calute.tools.ai_tools#
AI and machine learning tools for text processing and analysis.
This module provides a comprehensive set of AI-powered text processing tools for the Calute framework. It includes: - Text embedding generation using TF-IDF, sentence-transformers, or OpenAI - Text similarity calculation with multiple metrics (cosine, Jaccard, Levenshtein, semantic) - Text classification with keyword, sentiment, language, and topic detection - Text summarization using extractive and keyword-based methods - Named entity extraction for emails, URLs, phone numbers, dates, and more
All tools are implemented as AgentBaseFn subclasses for seamless integration with Calute agents and support context_variables for runtime configuration.
Example
>>> from calute.tools.ai_tools import TextSummarizer, TextSimilarity
>>> summary = TextSummarizer.static_call("Long article text...", method="extractive")
>>> similarity = TextSimilarity.static_call("text one", "text two", method="cosine")
- class calute.tools.ai_tools.EntityExtractor(name, bases, namespace, /, **kwargs)[source]#
Bases:
AgentBaseFnExtract named entities from text.
Uses regex patterns to identify and extract various entity types including emails, URLs, phone numbers, dates, times, currency values, hashtags, mentions, and proper names.
- Inherits from AgentBaseFn for agent integration.
- static static_call(text: str, entity_types: list[str] | None = None, **context_variables) dict[str, Any][source]#
Extract named entities from text using regex pattern matching.
Scans the input text for various entity types using predefined regular expression patterns. Returns deduplicated matches for each requested entity type.
- Parameters
text – The text to extract entities from.
entity_types – List of entity types to extract. If None, extracts all supported types. Supported types: - “emails”: Email addresses. - “urls”: HTTP/HTTPS URLs. - “phone_numbers”: Phone numbers in various formats. - “dates”: Dates in common formats (YYYY-MM-DD, MM/DD/YYYY, etc.). - “times”: Time expressions (HH:MM, HH:MM:SS, with optional AM/PM). - “numbers”: Integer and decimal numbers. - “hashtags”: Hashtag expressions (#word). - “mentions”: At-mention expressions (@user). - “currency”: Currency values ($, EUR, GBP, JPY prefixed). - “names”: Proper names (capitalized multi-word sequences).
**context_variables – Runtime context from the agent (unused).
- Returns
entities (dict[str, list[str]]): Mapping of entity type to a list of unique extracted values (max 20 per type).
total_entities (int): Total number of extracted entities across all types.
- Return type
A dictionary containing
Example
>>> result = EntityExtractor.static_call( ... "Contact john@example.com or visit https://example.com", ... entity_types=["emails", "urls"] ... ) >>> print(result["entities"]["emails"]) ['john@example.com']
- class calute.tools.ai_tools.TextClassifier(name, bases, namespace, /, **kwargs)[source]#
Bases:
AgentBaseFnClassify text into categories using various methods.
Supports keyword-based classification, sentiment analysis, language detection, and topic classification. Uses simple heuristic methods that work without external ML dependencies.
- Inherits from AgentBaseFn for agent integration.
- static static_call(text: str, categories: list[str] | None = None, method: str = 'keyword', **context_variables) dict[str, Any][source]#
Classify text into categories using heuristic methods.
Applies the selected classification method to determine the category, sentiment, language, or topic of the input text. All methods are lightweight and do not require external ML models.
- Parameters
text – The text to classify.
categories – List of candidate category labels. Required when method is “keyword”; ignored for other methods.
method –
Classification method to use. Options: - “keyword”: Match category labels against text content.
Requires the
categoriesargument.”sentiment”: Simple lexicon-based sentiment analysis returning positive, negative, or neutral.
”language”: Detect the language of the text using common word indicators (supports English, Spanish, French, German, Italian).
”topic”: Classify into predefined topics (technology, business, science, health, education) using keyword matching.
**context_variables – Runtime context from the agent (unused).
- Returns
- For “keyword”:
category (str): Best matching category.
confidence (float): Confidence score (0 to 1).
scores (dict): Per-category match counts.
- For “sentiment”:
sentiment (str): “positive”, “negative”, or “neutral”.
confidence (float): Confidence score (0 to 1).
positive_score (int): Count of positive word matches.
negative_score (int): Count of negative word matches.
- For “language”:
language (str): Detected language name.
confidence (float): Confidence score.
scores (dict): Per-language match counts.
- For “topic”:
topic (str): Detected topic label.
confidence (float): Confidence score (0 to 1).
scores (dict): Per-topic match counts.
error (str): Error message if the operation failed.
- Return type
A dictionary containing method-specific results
Example
>>> result = TextClassifier.static_call( ... "The algorithm processes data efficiently", ... method="topic" ... ) >>> print(result["topic"]) 'technology'
- class calute.tools.ai_tools.TextEmbedder(name, bases, namespace, /, **kwargs)[source]#
Bases:
AgentBaseFnGenerate text embeddings using various methods.
Supports multiple embedding backends including TF-IDF, sentence-transformers, and OpenAI embeddings. Falls back to simple word frequency vectors when sklearn is not available.
- Inherits from AgentBaseFn for agent integration.
- static static_call(text: str | list[str], method: str = 'tfidf', model_name: str | None = None, max_length: int = 512, **context_variables) dict[str, Any][source]#
Generate text embeddings using the specified method.
Converts one or more text strings into numerical vector representations. Supports TF-IDF (with sklearn fallback to word frequency), sentence-transformers for dense semantic embeddings, and OpenAI embedding API.
- Parameters
text – A single text string or a list of text strings to embed.
method –
Embedding method to use. Options: - “tfidf”: TF-IDF vectorization via sklearn (falls back to word
frequency vectors if sklearn is not installed).
”sentence-transformers”: Dense semantic embeddings using the sentence-transformers library.
”openai”: Embeddings via the OpenAI API (requires an OpenAI client in context_variables).
model_name – Model identifier for the embedding backend. Used by sentence-transformers (default: “all-MiniLM-L6-v2”) and OpenAI (default: “text-embedding-ada-002”). Ignored for TF-IDF.
max_length – Maximum number of characters per text. Texts longer than this are truncated before embedding.
**context_variables – Runtime context from the agent. For the “openai” method, must contain an “openai_client” key with an initialized OpenAI client instance.
- Returns
embeddings: List of embedding vectors (list of lists of floats).
shape: Tuple of (num_texts, embedding_dimension).
features: Top feature names (for TF-IDF method).
model: Model name used (for sentence-transformers and OpenAI).
usage: Token usage information (for OpenAI method).
error: Error message if the operation failed.
- Return type
A dictionary containing
Example
>>> result = TextEmbedder.static_call("Hello world", method="tfidf") >>> print(result["shape"]) (1, 2)
- class calute.tools.ai_tools.TextSimilarity(name, bases, namespace, /, **kwargs)[source]#
Bases:
AgentBaseFnCalculate text similarity using various metrics.
Provides multiple similarity calculation methods including cosine similarity, Jaccard index, Levenshtein distance, and semantic similarity using sentence embeddings.
- Inherits from AgentBaseFn for agent integration.
- static static_call(text1: str, text2: str, method: str = 'cosine', **context_variables) dict[str, Any][source]#
Calculate the similarity between two text strings.
Computes a similarity score using the chosen metric. All methods produce a normalized score in the range [0, 1] (or [-1, 1] for semantic), along with a human-readable interpretation of the result.
- Parameters
text1 – The first text to compare.
text2 – The second text to compare.
method –
Similarity metric to use. Options: - “cosine”: Cosine similarity on word frequency vectors.
Scale: 0 to 1 (1 = identical).
”jaccard”: Jaccard index on word sets (intersection / union). Scale: 0 to 1. Also returns common words.
”levenshtein”: Normalized Levenshtein edit distance. Scale: 0 to 1 (1 = identical). Also returns raw distance.
”semantic”: Cosine similarity on sentence-transformer embeddings. Scale: -1 to 1. Requires the sentence-transformers package.
**context_variables – Runtime context from the agent (unused).
- Returns
similarity (float): The computed similarity score.
method (str): The method used for comparison.
scale (str): Description of the score range.
interpretation (str): Human-readable strength label (“Very high”, “High”, “Moderate”, “Low”, “Very low”).
common_words (list[str]): Shared words (Jaccard only).
distance (int): Raw edit distance (Levenshtein only).
model (str): Embedding model used (semantic only).
error (str): Error message if the operation failed.
- Return type
A dictionary containing
Example
>>> result = TextSimilarity.static_call("hello world", "hello there") >>> print(result["similarity"]) 0.5
- class calute.tools.ai_tools.TextSummarizer(name, bases, namespace, /, **kwargs)[source]#
Bases:
AgentBaseFnSummarize text using various techniques.
Provides extractive summarization, keyword extraction, and statistical analysis of text. Uses sentence scoring based on word frequency for extractive summaries.
- Inherits from AgentBaseFn for agent integration.
- static static_call(text: str, method: str = 'extractive', max_sentences: int = 3, max_length: int | None = None, **context_variables) dict[str, Any][source]#
Generate a summary of the input text.
Supports extractive summarization (selecting important sentences), keyword extraction (identifying key terms and phrases), and statistical analysis (computing text metrics).
- Parameters
text – The text to summarize.
method –
Summarization method to use. Options: - “extractive”: Select the most important sentences based
on word frequency scoring. Returns a condensed version of the original text.
”keywords”: Extract the most frequent meaningful words and bigram phrases from the text.
”statistics”: Compute text statistics including word count, sentence count, vocabulary richness, and sentence length metrics.
max_sentences – Maximum number of sentences to include in an extractive summary. Defaults to 3.
max_length – Maximum character length for the summary output. If the summary exceeds this, it is truncated with “…”. Only applies to the “extractive” method. None means no limit.
**context_variables – Runtime context from the agent (unused).
- Returns
- For “extractive”:
summary (str): The extracted summary text.
original_length (int): Character count of original text.
summary_length (int): Character count of summary.
compression_ratio (float): Summary length / original length.
- For “keywords”:
keywords (list[str]): Top 10 most frequent words.
key_phrases (list[str]): Top 5 bigram phrases.
summary (str): Brief description of key topics.
- For “statistics”:
summary (dict): Dictionary with total_characters, total_words, unique_words, vocabulary_richness, total_sentences, avg_sentence_length, longest_sentence, shortest_sentence.
error (str): Error message if the operation failed.
- Return type
A dictionary containing method-specific results
Example
>>> result = TextSummarizer.static_call( ... "Long article text here...", ... method="extractive", ... max_sentences=2 ... ) >>> print(result["summary"])