calute.llms.base#
Base LLM interface for all providers.
This module provides the abstract base classes and configuration dataclasses for integrating Large Language Model (LLM) providers into the Calute framework. It defines a standardized interface that all provider implementations must follow, ensuring consistent behavior across different LLM backends.
The module supports features like: - Synchronous and asynchronous completion generation - Streaming responses with function call detection - Provider-specific configuration with sensible defaults - Automatic model metadata fetching - Context manager support for resource management - Tool/function call parsing from various formats
Supported provider implementations (in separate modules): - OpenAI (GPT-4, GPT-3.5, etc.) - Anthropic (Claude models) - Google (Gemini models) - vLLM (local deployment) - LiteLLM (unified interface)
- Typical usage example:
from calute.llms.openai_llm import OpenAILLM from calute.llms.base import LLMConfig
- config = LLMConfig(
model=”gpt-4”, temperature=0.7, max_tokens=2048, api_key=”your-api-key”
)
llm = OpenAILLM(config) response = await llm.generate_completion(“Hello, world!”) content = llm.extract_content(response)
- class calute.llms.base.BaseLLM(config: calute.llms.base.LLMConfig | None = None, **kwargs)[source]#
Bases:
ABCAbstract base class for all LLM provider implementations.
BaseLLM defines the standard interface that all LLM provider implementations must follow within the Calute framework. It provides common functionality for configuration management, message formatting, and resource handling, while requiring subclasses to implement provider-specific completion logic.
This class supports both synchronous and asynchronous operations, streaming responses with function call detection, and automatic model metadata fetching. It is designed to be used as an async context manager for proper resource cleanup.
- Subclasses must implement:
_initialize_client(): Set up the provider-specific client
generate_completion(): Generate completions from prompts
extract_content(): Extract text from provider responses
process_streaming_response(): Handle streaming with callbacks
stream_completion(): Synchronous streaming with function detection
astream_completion(): Asynchronous streaming with function detection
- config#
LLMConfig instance containing provider configuration.
Example
- class MyProviderLLM(BaseLLM):
- def _initialize_client(self) -> None:
self.client = MyProviderClient(api_key=self.config.api_key)
- async def generate_completion(self, prompt, **kwargs):
return await self.client.complete(prompt)
- def extract_content(self, response) -> str:
return response.text
# Usage async with MyProviderLLM(config) as llm:
response = await llm.generate_completion(“Hello!”) print(llm.extract_content(response))
- abstract async astream_completion(response: Any, agent: Any | None = None) AsyncIterator[dict[str, Any]][source]#
Async stream completion chunks with function call detection.
- Parameters
response – The async streaming response from the provider
agent – Optional agent for function detection
- Yields
Dictionary with streaming chunk information
- async close() None[source]#
Close any open connections and release resources.
This method should be called when done using the LLM provider to properly clean up resources. It is called automatically when using the provider as an async context manager.
Override in subclasses to implement provider-specific cleanup such as closing HTTP sessions or releasing connection pools.
- abstract extract_content(response: Any) str[source]#
Extract text content from provider response.
- Parameters
response – The raw response from the provider
- Returns
The extracted text content
- fetch_model_info() dict[str, Any][source]#
Fetch model metadata from provider API.
Override in subclasses to implement provider-specific fetching of model capabilities and limits. This information can be used to optimize token usage and prevent context overflow.
- Common metadata fields include:
max_model_len: Maximum context window size in tokens
context_window: Alias for max_model_len
supports_function_calling: Whether model supports tools
supports_vision: Whether model can process images
input_token_limit: Maximum input tokens
output_token_limit: Maximum output tokens
- Returns
Dictionary with model metadata. Empty dict if metadata cannot be fetched or is not supported by the provider.
Note
This method is called by _auto_fetch_model_info() during client initialization. Errors are silently ignored to prevent initialization failures.
- format_messages(messages: list[dict[str, str]], system_prompt: str | None = None) list[dict[str, str]][source]#
Format messages for the provider.
Prepares a list of messages for the LLM API call by optionally prepending a system prompt. The default implementation simply adds a system message at the beginning if provided.
Override in subclasses for provider-specific message formatting requirements (e.g., role name mappings, message structure).
- Parameters
messages – List of message dictionaries, each containing ‘role’ and ‘content’ keys. Roles are typically ‘user’, ‘assistant’, or ‘system’.
system_prompt – Optional system prompt to prepend as the first message with role=’system’.
- Returns
Formatted list of message dictionaries ready for the API.
Example
messages = [{“role”: “user”, “content”: “Hello”}] formatted = llm.format_messages(messages, “You are helpful.”) # Returns: # [ # {“role”: “system”, “content”: “You are helpful.”}, # {“role”: “user”, “content”: “Hello”} # ]
- abstract async generate_completion(prompt: str | list[dict[str, str]], model: str | None = None, temperature: float | None = None, max_tokens: int | None = None, top_p: float | None = None, stop: list[str] | None = None, stream: bool | None = None, **kwargs) Any[source]#
Generate a completion from the LLM.
- Parameters
prompt – The prompt string or list of messages
model – Model to use (overrides config)
temperature – Temperature for sampling (overrides config)
max_tokens – Maximum tokens to generate (overrides config)
top_p – Top-p sampling parameter (overrides config)
stop – Stop sequences (overrides config)
stream – Whether to stream the response (overrides config)
**kwargs – Additional provider-specific parameters
- Returns
The completion response (format varies by provider)
- get_model_info() dict[str, Any][source]#
Get information about the current model configuration.
Returns a dictionary containing the current configuration and provider information. Useful for debugging, logging, and displaying model status in UIs.
- Returns
provider: Provider name derived from class name
model: Model identifier from config
temperature: Current temperature setting
max_tokens: Maximum tokens for generation
max_model_len: Maximum context length (if known)
stream: Whether streaming is enabled
- Return type
Dictionary with model information containing
- parse_tool_calls(raw_data: Any) list[dict[str, Any]][source]#
Parse tool/function calls from provider-specific format.
- Parameters
raw_data – Provider-specific tool call data
- Returns
Standardized list of tool calls
- abstract async process_streaming_response(response: Any, callback: Callable[[str, Any], None]) str[source]#
Process a streaming response from the provider.
- Parameters
response – The streaming response object
callback – Function to call for each chunk (content, raw_chunk)
- Returns
The complete accumulated content
- abstract stream_completion(response: Any, agent: Any | None = None) Iterator[dict[str, Any]][source]#
Stream completion chunks with function call detection.
- Parameters
response – The streaming response from the provider
agent – Optional agent for function detection
- Yields
Dictionary with streaming chunk information –
content: Text content in this chunk
buffered_content: Accumulated content so far
function_calls: List of detected function calls
tool_calls: Raw tool call data from provider
is_final: Whether this is the final chunk
- validate_config() None[source]#
Validate the configuration for the provider.
Checks that all configuration values are within valid ranges. This method can be called explicitly or automatically by provider implementations during initialization.
- Raises
ValueError – If model name is empty or missing.
ValueError – If temperature is not between 0 and 2.
ValueError – If max_tokens is not positive.
ValueError – If top_p is not between 0 and 1.
- class calute.llms.base.LLMConfig(model: str, temperature: float = 0.7, max_tokens: int = 2048, top_p: float = 0.95, top_k: int | None = None, min_p: float = 0.0, frequency_penalty: float = 0.0, presence_penalty: float = 0.0, repetition_penalty: float = 1.0, stop: list[str] | None = None, stream: bool = False, api_key: str | None = None, base_url: str | None = None, timeout: float = 60.0, retry_attempts: int = 3, extra_params: dict[str, typing.Any] = <factory>, max_model_len: int | None = None, model_metadata: dict[str, typing.Any] = <factory>)[source]#
Bases:
objectConfiguration dataclass for LLM providers.
LLMConfig provides a standardized way to configure LLM provider instances with common parameters like model selection, sampling settings, and API credentials. All provider implementations accept this configuration, though some parameters may be provider-specific.
- model#
The model identifier to use (e.g., ‘gpt-4’, ‘claude-3-opus’).
- Type
str
- temperature#
Controls randomness in sampling. Higher values (0.8-1.0) make output more random, lower values (0.1-0.3) more deterministic.
- Type
float
- max_tokens#
Maximum number of tokens to generate in the response.
- Type
int
- top_p#
Nucleus sampling parameter. Only tokens comprising the top_p probability mass are considered for sampling.
- Type
float
- top_k#
Top-k sampling parameter. Only the top k most likely tokens are considered. Set to None to disable.
- Type
int | None
- min_p#
Minimum probability threshold for sampling.
- Type
float
- frequency_penalty#
Penalizes tokens based on their frequency in the text so far, reducing repetition. Range: -2.0 to 2.0.
- Type
float
- presence_penalty#
Penalizes tokens that have appeared at all in the text so far, encouraging topic diversity. Range: -2.0 to 2.0.
- Type
float
- repetition_penalty#
Multiplicative penalty for token repetition.
- Type
float
- stop#
List of sequences where the model should stop generating.
- Type
list[str] | None
- stream#
Whether to stream the response token by token.
- Type
bool
- api_key#
API key for the provider. Can also be set via environment.
- Type
str | None
- base_url#
Custom base URL for API requests (useful for proxies or self-hosted instances).
- Type
str | None
- timeout#
Request timeout in seconds.
- Type
float
- retry_attempts#
Number of retry attempts for failed requests.
- Type
int
- extra_params#
Dictionary for provider-specific parameters not covered by the standard configuration options.
- Type
dict[str, Any]
- max_model_len#
Maximum context length supported by the model. Auto-populated by fetch_model_info() when available.
- Type
int | None
- model_metadata#
Dictionary storing additional model information fetched from the provider API.
- Type
dict[str, Any]
Example
- config = LLMConfig(
model=”gpt-4-turbo”, temperature=0.5, max_tokens=4096, stream=True, api_key=”sk-…”
)
- api_key: str | None = None#
- base_url: str | None = None#
- extra_params: dict[str, Any]#
- frequency_penalty: float = 0.0#
- max_model_len: int | None = None#
- max_tokens: int = 2048#
- min_p: float = 0.0#
- model: str#
- model_metadata: dict[str, Any]#
- presence_penalty: float = 0.0#
- repetition_penalty: float = 1.0#
- retry_attempts: int = 3#
- stop: list[str] | None = None#
- stream: bool = False#
- temperature: float = 0.7#
- timeout: float = 60.0#
- top_k: int | None = None#
- top_p: float = 0.95#