calute.api_server.completion_service#
Chat completion service for handling Calute agent interactions.
This module provides the completion service infrastructure for Calute, including: - Non-streaming chat completions with full response generation - Streaming chat completions with server-sent events - Request parameter application to agents - Integration with Calute’s agent execution system
The service follows the OpenAI-compatible API format for chat completions and supports both synchronous and asynchronous response generation.
- class calute.api_server.completion_service.CompletionService(calute: Calute, can_overide_samplings: bool = False)[source]#
Bases:
objectService for handling chat completions with Calute agents.
Provides the core functionality for processing chat completion requests, including both streaming and non-streaming responses. This service wraps the Calute agent execution system and formats responses according to the OpenAI-compatible API specification.
The service is used internally by the
ChatRouterandUnifiedChatRouterto delegate completion logic away from the HTTP routing layer.- calute#
The
Caluteinstance used for running agent completions.
- can_overide_samplings#
Flag indicating whether request parameters can override agent sampling settings. When
True, parameters liketemperature,top_p,max_tokens,stop,presence_penalty,frequency_penalty,repetition_penalty,top_k, andmin_pfrom the request will be applied to the agent before execution.
Example
>>> from calute.api_server.completion_service import CompletionService >>> service = CompletionService(calute_instance, can_overide_samplings=True)
- apply_request_parameters(agent: Agent, request: ChatCompletionRequest) None[source]#
Apply sampling parameters from the request to the agent.
Conditionally transfers sampling parameters from the incoming request to the agent configuration. This only takes effect when
can_overide_samplingsisTrue. Each parameter is applied only if it is explicitly set (notNone) in the request.- The following parameters are supported:
max_tokens: Maximum number of tokens to generate.temperature: Sampling temperature.top_p: Nucleus sampling threshold.top_k: Top-k sampling parameter.min_p: Minimum probability threshold.stop: Stop sequences for generation.presence_penalty: Presence penalty value.frequency_penalty: Frequency penalty value.repetition_penalty: Repetition penalty value.
- Parameters
agent – The
Agentinstance whose sampling settings will be modified in-place.request – The
ChatCompletionRequestcontaining the sampling parameters to apply.
- async create_completion(agent: Agent, messages: MessagesHistory, request: ChatCompletionRequest) ChatCompletionResponse[source]#
Create a non-streaming chat completion.
Executes the Calute agent with the provided messages and returns a complete response. The synchronous
calute.run()call is offloaded to a thread executor vialoop.run_in_executorto avoid blocking the async event loop.The response includes the full generated text, usage statistics (prompt tokens, completion tokens, processing time, etc.), and a finish reason of
"stop".- Parameters
agent – The
Agentinstance to use for generating the completion.messages – The
MessagesHistorycontaining the conversation context to process.request – The original
ChatCompletionRequest, used to extract the model name for the response object.
- Returns
A
ChatCompletionResponsecontaining a single choice with the assistant’s full response message, usage information (token counts, processing time, tokens per second), and finish reason"stop".
- async create_streaming_completion(agent: Agent, messages: MessagesHistory, request: ChatCompletionRequest) AsyncIterator[str][source]#
Create a streaming chat completion using server-sent events.
Executes the Calute agent in streaming mode and yields response chunks as SSE-formatted strings. Each chunk is serialized as a
ChatCompletionStreamResponseJSON object prefixed with"data: "and followed by double newlines, conforming to the SSE protocol.The method yields an
asyncio.sleep(0)after each chunk to allow the event loop to process other tasks, enabling cooperative multitasking during long-running generations.After all content chunks have been yielded, a final chunk with an empty delta and
finish_reason="stop"is emitted, followed by a"data: [DONE]"sentinel to signal stream completion.- Parameters
agent – The
Agentinstance to use for generating the streaming completion.messages – The
MessagesHistorycontaining the conversation context to process.request – The original
ChatCompletionRequest, used to extract the model name for each streamed response chunk.
- Yields
Byte-encoded and plain string SSE events. Each content event is a
bytesobject containing"data: {json}\n\n". The final"[DONE]"event is a plain string.