calute.llms.base#

Base LLM interface for all providers.

This module provides the abstract base classes and configuration dataclasses for integrating Large Language Model (LLM) providers into the Calute framework. It defines a standardized interface that all provider implementations must follow, ensuring consistent behavior across different LLM backends.

The module supports features like: - Synchronous and asynchronous completion generation - Streaming responses with function call detection - Provider-specific configuration with sensible defaults - Automatic model metadata fetching - Context manager support for resource management - Tool/function call parsing from various formats

Supported provider implementations (in separate modules): - OpenAI (GPT-4, GPT-3.5, etc.) - Anthropic (Claude models) - Google (Gemini models) - vLLM (local deployment) - LiteLLM (unified interface)

Typical usage example:

from calute.llms.openai_llm import OpenAILLM from calute.llms.base import LLMConfig

config = LLMConfig(

model=”gpt-4”, temperature=0.7, max_tokens=2048, api_key=”your-api-key”

)

llm = OpenAILLM(config) response = await llm.generate_completion(“Hello, world!”) content = llm.extract_content(response)

class calute.llms.base.BaseLLM(config: calute.llms.base.LLMConfig | None = None, **kwargs)[source]#

Bases: ABC

Abstract base class for all LLM provider implementations.

BaseLLM defines the standard interface that all LLM provider implementations must follow within the Calute framework. It provides common functionality for configuration management, message formatting, and resource handling, while requiring subclasses to implement provider-specific completion logic.

This class supports both synchronous and asynchronous operations, streaming responses with function call detection, and automatic model metadata fetching. It is designed to be used as an async context manager for proper resource cleanup.

Subclasses must implement:
  • _initialize_client(): Set up the provider-specific client

  • generate_completion(): Generate completions from prompts

  • extract_content(): Extract text from provider responses

  • process_streaming_response(): Handle streaming with callbacks

  • stream_completion(): Synchronous streaming with function detection

  • astream_completion(): Asynchronous streaming with function detection

config#

LLMConfig instance containing provider configuration.

Example

class MyProviderLLM(BaseLLM):
def _initialize_client(self) -> None:

self.client = MyProviderClient(api_key=self.config.api_key)

async def generate_completion(self, prompt, **kwargs):

return await self.client.complete(prompt)

def extract_content(self, response) -> str:

return response.text

# Usage async with MyProviderLLM(config) as llm:

response = await llm.generate_completion(“Hello!”) print(llm.extract_content(response))

abstract async astream_completion(response: Any, agent: Any | None = None) AsyncIterator[dict[str, Any]][source]#

Async stream completion chunks with function call detection.

Parameters
  • response – The async streaming response from the provider

  • agent – Optional agent for function detection

Yields

Dictionary with streaming chunk information

async close() None[source]#

Close any open connections and release resources.

This method should be called when done using the LLM provider to properly clean up resources. It is called automatically when using the provider as an async context manager.

Override in subclasses to implement provider-specific cleanup such as closing HTTP sessions or releasing connection pools.

abstract extract_content(response: Any) str[source]#

Extract text content from provider response.

Parameters

response – The raw response from the provider

Returns

The extracted text content

fetch_model_info() dict[str, Any][source]#

Fetch model metadata from provider API.

Override in subclasses to implement provider-specific fetching of model capabilities and limits. This information can be used to optimize token usage and prevent context overflow.

Common metadata fields include:
  • max_model_len: Maximum context window size in tokens

  • context_window: Alias for max_model_len

  • supports_function_calling: Whether model supports tools

  • supports_vision: Whether model can process images

  • input_token_limit: Maximum input tokens

  • output_token_limit: Maximum output tokens

Returns

Dictionary with model metadata. Empty dict if metadata cannot be fetched or is not supported by the provider.

Note

This method is called by _auto_fetch_model_info() during client initialization. Errors are silently ignored to prevent initialization failures.

format_messages(messages: list[dict[str, str]], system_prompt: str | None = None) list[dict[str, str]][source]#

Format messages for the provider.

Prepares a list of messages for the LLM API call by optionally prepending a system prompt. The default implementation simply adds a system message at the beginning if provided.

Override in subclasses for provider-specific message formatting requirements (e.g., role name mappings, message structure).

Parameters
  • messages – List of message dictionaries, each containing ‘role’ and ‘content’ keys. Roles are typically ‘user’, ‘assistant’, or ‘system’.

  • system_prompt – Optional system prompt to prepend as the first message with role=’system’.

Returns

Formatted list of message dictionaries ready for the API.

Example

messages = [{“role”: “user”, “content”: “Hello”}] formatted = llm.format_messages(messages, “You are helpful.”) # Returns: # [ # {“role”: “system”, “content”: “You are helpful.”}, # {“role”: “user”, “content”: “Hello”} # ]

abstract async generate_completion(prompt: str | list[dict[str, str]], model: str | None = None, temperature: float | None = None, max_tokens: int | None = None, top_p: float | None = None, stop: list[str] | None = None, stream: bool | None = None, **kwargs) Any[source]#

Generate a completion from the LLM.

Parameters
  • prompt – The prompt string or list of messages

  • model – Model to use (overrides config)

  • temperature – Temperature for sampling (overrides config)

  • max_tokens – Maximum tokens to generate (overrides config)

  • top_p – Top-p sampling parameter (overrides config)

  • stop – Stop sequences (overrides config)

  • stream – Whether to stream the response (overrides config)

  • **kwargs – Additional provider-specific parameters

Returns

The completion response (format varies by provider)

get_model_info() dict[str, Any][source]#

Get information about the current model configuration.

Returns a dictionary containing the current configuration and provider information. Useful for debugging, logging, and displaying model status in UIs.

Returns

  • provider: Provider name derived from class name

  • model: Model identifier from config

  • temperature: Current temperature setting

  • max_tokens: Maximum tokens for generation

  • max_model_len: Maximum context length (if known)

  • stream: Whether streaming is enabled

Return type

Dictionary with model information containing

parse_tool_calls(raw_data: Any) list[dict[str, Any]][source]#

Parse tool/function calls from provider-specific format.

Parameters

raw_data – Provider-specific tool call data

Returns

Standardized list of tool calls

abstract async process_streaming_response(response: Any, callback: Callable[[str, Any], None]) str[source]#

Process a streaming response from the provider.

Parameters
  • response – The streaming response object

  • callback – Function to call for each chunk (content, raw_chunk)

Returns

The complete accumulated content

abstract stream_completion(response: Any, agent: Any | None = None) Iterator[dict[str, Any]][source]#

Stream completion chunks with function call detection.

Parameters
  • response – The streaming response from the provider

  • agent – Optional agent for function detection

Yields

Dictionary with streaming chunk information

  • content: Text content in this chunk

  • buffered_content: Accumulated content so far

  • function_calls: List of detected function calls

  • tool_calls: Raw tool call data from provider

  • is_final: Whether this is the final chunk

validate_config() None[source]#

Validate the configuration for the provider.

Checks that all configuration values are within valid ranges. This method can be called explicitly or automatically by provider implementations during initialization.

Raises
  • ValueError – If model name is empty or missing.

  • ValueError – If temperature is not between 0 and 2.

  • ValueError – If max_tokens is not positive.

  • ValueError – If top_p is not between 0 and 1.

class calute.llms.base.LLMConfig(model: str, temperature: float = 0.7, max_tokens: int = 2048, top_p: float = 0.95, top_k: int | None = None, min_p: float = 0.0, frequency_penalty: float = 0.0, presence_penalty: float = 0.0, repetition_penalty: float = 1.0, stop: list[str] | None = None, stream: bool = False, api_key: str | None = None, base_url: str | None = None, timeout: float = 60.0, retry_attempts: int = 3, extra_params: dict[str, typing.Any] = <factory>, max_model_len: int | None = None, model_metadata: dict[str, typing.Any] = <factory>)[source]#

Bases: object

Configuration dataclass for LLM providers.

LLMConfig provides a standardized way to configure LLM provider instances with common parameters like model selection, sampling settings, and API credentials. All provider implementations accept this configuration, though some parameters may be provider-specific.

model#

The model identifier to use (e.g., ‘gpt-4’, ‘claude-3-opus’).

Type

str

temperature#

Controls randomness in sampling. Higher values (0.8-1.0) make output more random, lower values (0.1-0.3) more deterministic.

Type

float

max_tokens#

Maximum number of tokens to generate in the response.

Type

int

top_p#

Nucleus sampling parameter. Only tokens comprising the top_p probability mass are considered for sampling.

Type

float

top_k#

Top-k sampling parameter. Only the top k most likely tokens are considered. Set to None to disable.

Type

int | None

min_p#

Minimum probability threshold for sampling.

Type

float

frequency_penalty#

Penalizes tokens based on their frequency in the text so far, reducing repetition. Range: -2.0 to 2.0.

Type

float

presence_penalty#

Penalizes tokens that have appeared at all in the text so far, encouraging topic diversity. Range: -2.0 to 2.0.

Type

float

repetition_penalty#

Multiplicative penalty for token repetition.

Type

float

stop#

List of sequences where the model should stop generating.

Type

list[str] | None

stream#

Whether to stream the response token by token.

Type

bool

api_key#

API key for the provider. Can also be set via environment.

Type

str | None

base_url#

Custom base URL for API requests (useful for proxies or self-hosted instances).

Type

str | None

timeout#

Request timeout in seconds.

Type

float

retry_attempts#

Number of retry attempts for failed requests.

Type

int

extra_params#

Dictionary for provider-specific parameters not covered by the standard configuration options.

Type

dict[str, Any]

max_model_len#

Maximum context length supported by the model. Auto-populated by fetch_model_info() when available.

Type

int | None

model_metadata#

Dictionary storing additional model information fetched from the provider API.

Type

dict[str, Any]

Example

config = LLMConfig(

model=”gpt-4-turbo”, temperature=0.5, max_tokens=4096, stream=True, api_key=”sk-…”

)

api_key: str | None = None#
base_url: str | None = None#
extra_params: dict[str, Any]#
frequency_penalty: float = 0.0#
max_model_len: int | None = None#
max_tokens: int = 2048#
min_p: float = 0.0#
model: str#
model_metadata: dict[str, Any]#
presence_penalty: float = 0.0#
repetition_penalty: float = 1.0#
retry_attempts: int = 3#
stop: list[str] | None = None#
stream: bool = False#
temperature: float = 0.7#
timeout: float = 60.0#
top_k: int | None = None#
top_p: float = 0.95#