calute.api_server.completion_service#

Chat completion service for handling Calute agent interactions.

This module provides the completion service infrastructure for Calute, including: - Non-streaming chat completions with full response generation - Streaming chat completions with server-sent events - Request parameter application to agents - Integration with Calute’s agent execution system

The service follows the OpenAI-compatible API format for chat completions and supports both synchronous and asynchronous response generation.

class calute.api_server.completion_service.CompletionService(calute: Calute, can_overide_samplings: bool = False)[source]#

Bases: object

Service for handling chat completions with Calute agents.

Provides the core functionality for processing chat completion requests, including both streaming and non-streaming responses. This service wraps the Calute agent execution system and formats responses according to the OpenAI-compatible API specification.

The service is used internally by the ChatRouter and UnifiedChatRouter to delegate completion logic away from the HTTP routing layer.

calute#

The Calute instance used for running agent completions.

can_overide_samplings#

Flag indicating whether request parameters can override agent sampling settings. When True, parameters like temperature, top_p, max_tokens, stop, presence_penalty, frequency_penalty, repetition_penalty, top_k, and min_p from the request will be applied to the agent before execution.

Example

>>> from calute.api_server.completion_service import CompletionService
>>> service = CompletionService(calute_instance, can_overide_samplings=True)
apply_request_parameters(agent: Agent, request: ChatCompletionRequest) None[source]#

Apply sampling parameters from the request to the agent.

Conditionally transfers sampling parameters from the incoming request to the agent configuration. This only takes effect when can_overide_samplings is True. Each parameter is applied only if it is explicitly set (not None) in the request.

The following parameters are supported:
  • max_tokens: Maximum number of tokens to generate.

  • temperature: Sampling temperature.

  • top_p: Nucleus sampling threshold.

  • top_k: Top-k sampling parameter.

  • min_p: Minimum probability threshold.

  • stop: Stop sequences for generation.

  • presence_penalty: Presence penalty value.

  • frequency_penalty: Frequency penalty value.

  • repetition_penalty: Repetition penalty value.

Parameters
  • agent – The Agent instance whose sampling settings will be modified in-place.

  • request – The ChatCompletionRequest containing the sampling parameters to apply.

async create_completion(agent: Agent, messages: MessagesHistory, request: ChatCompletionRequest) ChatCompletionResponse[source]#

Create a non-streaming chat completion.

Executes the Calute agent with the provided messages and returns a complete response. The synchronous calute.run() call is offloaded to a thread executor via loop.run_in_executor to avoid blocking the async event loop.

The response includes the full generated text, usage statistics (prompt tokens, completion tokens, processing time, etc.), and a finish reason of "stop".

Parameters
  • agent – The Agent instance to use for generating the completion.

  • messages – The MessagesHistory containing the conversation context to process.

  • request – The original ChatCompletionRequest, used to extract the model name for the response object.

Returns

A ChatCompletionResponse containing a single choice with the assistant’s full response message, usage information (token counts, processing time, tokens per second), and finish reason "stop".

async create_streaming_completion(agent: Agent, messages: MessagesHistory, request: ChatCompletionRequest) AsyncIterator[str][source]#

Create a streaming chat completion using server-sent events.

Executes the Calute agent in streaming mode and yields response chunks as SSE-formatted strings. Each chunk is serialized as a ChatCompletionStreamResponse JSON object prefixed with "data: " and followed by double newlines, conforming to the SSE protocol.

The method yields an asyncio.sleep(0) after each chunk to allow the event loop to process other tasks, enabling cooperative multitasking during long-running generations.

After all content chunks have been yielded, a final chunk with an empty delta and finish_reason="stop" is emitted, followed by a "data: [DONE]" sentinel to signal stream completion.

Parameters
  • agent – The Agent instance to use for generating the streaming completion.

  • messages – The MessagesHistory containing the conversation context to process.

  • request – The original ChatCompletionRequest, used to extract the model name for each streamed response chunk.

Yields

Byte-encoded and plain string SSE events. Each content event is a bytes object containing "data: {json}\n\n". The final "[DONE]" event is a plain string.