calute.llms.base

calute.llms.base#

Base LLM interface for all providers.

This module provides the abstract base classes and configuration dataclasses for integrating Large Language Model (LLM) providers into the Calute framework. It defines a standardized interface that all provider implementations must follow, ensuring consistent behavior across different LLM backends.

The module supports features like: - Synchronous and asynchronous completion generation - Streaming responses with function call detection - Provider-specific configuration with sensible defaults - Automatic model metadata fetching - Context manager support for resource management - Tool/function call parsing from various formats

Supported provider implementations (in separate modules): - OpenAI (GPT-4, GPT-3.5, etc.) - Anthropic (Claude models) - Google (Gemini models) - vLLM (local deployment) - LiteLLM (unified interface)

Typical usage example:

from calute.llms.openai_llm import OpenAILLM from calute.llms.base import LLMConfig

config = LLMConfig(: model=”gpt-4”, temperature=0.7, max_tokens=2048, api_key=”your-api-key”

)

llm = OpenAILLM(config) response = await llm.generate_completion(“Hello, world!”) content = llm.extract_content(response)

class calute.llms.base.BaseLLM(config: calute.llms.base.LLMConfig | None = None, **kwargs)[source]#

Bases: ABC

Abstract base class for all LLM provider implementations.

BaseLLM defines the standard interface that all LLM provider implementations must follow within the Calute framework. It provides common functionality for configuration management, message formatting, and resource handling, while requiring subclasses to implement provider-specific completion logic.

This class supports both synchronous and asynchronous operations, streaming responses with function call detection, and automatic model metadata fetching. It is designed to be used as an async context manager for proper resource cleanup.

Subclasses must implement:

_initialize_client(): Set up the provider-specific client
generate_completion(): Generate completions from prompts
extract_content(): Extract text from provider responses
process_streaming_response(): Handle streaming with callbacks
stream_completion(): Synchronous streaming with function detection
astream_completion(): Asynchronous streaming with function detection

config#: LLMConfig instance containing provider configuration.

Example

class MyProviderLLM(BaseLLM):

def _initialize_client(self) -> None:: self.client = MyProviderClient(api_key=self.config.api_key)
async def generate_completion(self, prompt, **kwargs):: return await self.client.complete(prompt)
def extract_content(self, response) -> str:: return response.text

# Usage async with MyProviderLLM(config) as llm:

response = await llm.generate_completion(“Hello!”) print(llm.extract_content(response))

abstract async astream_completion(response: Any, agent: Any | None = None) → AsyncIterator[dict[str, Any]][source]#

Async stream completion chunks with function call detection.

Parameters

response – The async streaming response from the provider
agent – Optional agent for function detection

Yields

Dictionary with streaming chunk information

async close() → None[source]#

Close any open connections and release resources.

This method should be called when done using the LLM provider to properly clean up resources. It is called automatically when using the provider as an async context manager.

Override in subclasses to implement provider-specific cleanup such as closing HTTP sessions or releasing connection pools.

abstract extract_content(response: Any) → str[source]#

Extract text content from provider response.

Parameters: response – The raw response from the provider
Returns: The extracted text content

fetch_model_info() → dict[str, Any][source]#

Fetch model metadata from provider API.

Override in subclasses to implement provider-specific fetching of model capabilities and limits. This information can be used to optimize token usage and prevent context overflow.

Common metadata fields include:

max_model_len: Maximum context window size in tokens
context_window: Alias for max_model_len
supports_function_calling: Whether model supports tools
supports_vision: Whether model can process images
input_token_limit: Maximum input tokens
output_token_limit: Maximum output tokens

Returns: Dictionary with model metadata. Empty dict if metadata cannot be fetched or is not supported by the provider.

Note

This method is called by _auto_fetch_model_info() during client initialization. Errors are silently ignored to prevent initialization failures.

format_messages(messages: list[dict[str, str]], system_prompt: str | None = None) → list[dict[str, str]][source]#

Format messages for the provider.

Prepares a list of messages for the LLM API call by optionally prepending a system prompt. The default implementation simply adds a system message at the beginning if provided.

Override in subclasses for provider-specific message formatting requirements (e.g., role name mappings, message structure).

Parameters

messages – List of message dictionaries, each containing ‘role’ and ‘content’ keys. Roles are typically ‘user’, ‘assistant’, or ‘system’.
system_prompt – Optional system prompt to prepend as the first message with role=’system’.

Returns

Formatted list of message dictionaries ready for the API.

Example

messages = [{“role”: “user”, “content”: “Hello”}] formatted = llm.format_messages(messages, “You are helpful.”) # Returns: # [ # {“role”: “system”, “content”: “You are helpful.”}, # {“role”: “user”, “content”: “Hello”} # ]

Generate a completion from the LLM.

Parameters

prompt – The prompt string or list of messages
model – Model to use (overrides config)
temperature – Temperature for sampling (overrides config)
max_tokens – Maximum tokens to generate (overrides config)
top_p – Top-p sampling parameter (overrides config)
stop – Stop sequences (overrides config)
stream – Whether to stream the response (overrides config)
**kwargs – Additional provider-specific parameters

Returns

The completion response (format varies by provider)

get_model_info() → dict[str, Any][source]#

Get information about the current model configuration.

Returns a dictionary containing the current configuration and provider information. Useful for debugging, logging, and displaying model status in UIs.

Returns

provider: Provider name derived from class name
model: Model identifier from config
temperature: Current temperature setting
max_tokens: Maximum tokens for generation
max_model_len: Maximum context length (if known)
stream: Whether streaming is enabled

Return type

Dictionary with model information containing

parse_tool_calls(raw_data: Any) → list[dict[str, Any]][source]#

Parse tool/function calls from provider-specific format.

Parameters: raw_data – Provider-specific tool call data
Returns: Standardized list of tool calls

abstract async process_streaming_response(response: Any, callback: Callable[[str, Any], None]) → str[source]#

Process a streaming response from the provider.

Parameters

response – The streaming response object
callback – Function to call for each chunk (content, raw_chunk)

Returns

The complete accumulated content

abstract stream_completion(response: Any, agent: Any | None = None) → Iterator[dict[str, Any]][source]#

Stream completion chunks with function call detection.

Parameters

response – The streaming response from the provider
agent – Optional agent for function detection

Yields

Dictionary with streaming chunk information –

content: Text content in this chunk
buffered_content: Accumulated content so far
function_calls: List of detected function calls
tool_calls: Raw tool call data from provider
is_final: Whether this is the final chunk

validate_config() → None[source]#

Validate the configuration for the provider.

Checks that all configuration values are within valid ranges. This method can be called explicitly or automatically by provider implementations during initialization.

Raises

ValueError – If model name is empty or missing.
ValueError – If temperature is not between 0 and 2.
ValueError – If max_tokens is not positive.
ValueError – If top_p is not between 0 and 1.

class calute.llms.base.LLMConfig(model: str, temperature: float = 0.7, max_tokens: int = 2048, top_p: float = 0.95, top_k: int | None = None, min_p: float = 0.0, frequency_penalty: float = 0.0, presence_penalty: float = 0.0, repetition_penalty: float = 1.0, stop: list[str] | None = None, stream: bool = False, api_key: str | None = None, base_url: str | None = None, timeout: float = 60.0, retry_attempts: int = 3, extra_params: dict[str, typing.Any] = <factory>, max_model_len: int | None = None, model_metadata: dict[str, typing.Any] = <factory>)[source]#

Bases: object

Configuration dataclass for LLM providers.

LLMConfig provides a standardized way to configure LLM provider instances with common parameters like model selection, sampling settings, and API credentials. All provider implementations accept this configuration, though some parameters may be provider-specific.

model#

The model identifier to use (e.g., ‘gpt-4’, ‘claude-3-opus’).

Type: str

temperature#

Controls randomness in sampling. Higher values (0.8-1.0) make output more random, lower values (0.1-0.3) more deterministic.

Type: float

max_tokens#

Maximum number of tokens to generate in the response.

Type: int

top_p#

Nucleus sampling parameter. Only tokens comprising the top_p probability mass are considered for sampling.

Type: float

top_k#

Top-k sampling parameter. Only the top k most likely tokens are considered. Set to None to disable.

Type: int | None

min_p#

Minimum probability threshold for sampling.

Type: float

frequency_penalty#

Penalizes tokens based on their frequency in the text so far, reducing repetition. Range: -2.0 to 2.0.

Type: float

presence_penalty#

Penalizes tokens that have appeared at all in the text so far, encouraging topic diversity. Range: -2.0 to 2.0.

Type: float

repetition_penalty#

Multiplicative penalty for token repetition.

Type: float

stop#

List of sequences where the model should stop generating.

Type: list[str] | None

stream#

Whether to stream the response token by token.

Type: bool

api_key#

API key for the provider. Can also be set via environment.

Type: str | None

base_url#

Custom base URL for API requests (useful for proxies or self-hosted instances).

Type: str | None

timeout#

Request timeout in seconds.

Type: float

retry_attempts#

Number of retry attempts for failed requests.

Type: int

extra_params#

Dictionary for provider-specific parameters not covered by the standard configuration options.

Type: dict[str, Any]

max_model_len#

Maximum context length supported by the model. Auto-populated by fetch_model_info() when available.

Type: int | None

model_metadata#

Dictionary storing additional model information fetched from the provider API.

Type: dict[str, Any]

Example

config = LLMConfig(: model=”gpt-4-turbo”, temperature=0.5, max_tokens=4096, stream=True, api_key=”sk-…”

)

api_key: str | None = None#

base_url: str | None = None#

extra_params: dict[str, Any]#

frequency_penalty: float = 0.0#

max_model_len: int | None = None#

max_tokens: int = 2048#

min_p: float = 0.0#

model: str#

model_metadata: dict[str, Any]#

presence_penalty: float = 0.0#

repetition_penalty: float = 1.0#

retry_attempts: int = 3#

stop: list[str] | None = None#

stream: bool = False#

temperature: float = 0.7#

timeout: float = 60.0#

top_k: int | None = None#

top_p: float = 0.95#

calute.llms.base

Contents

calute.llms.base#