# Agno (/docs/agno) Headroom integrates with [Agno](https://github.com/agno-agi/agno) (formerly Phidata) to compress context for AI agents. Wrap any Agno model for automatic optimization, and use hooks for observability. Installation [#installation] ```bash pip install "headroom-ai[agno]" agno ``` Quick start [#quick-start] ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from headroom.integrations.agno import HeadroomAgnoModel model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) agent = Agent(model=model) response = agent.run("What's the capital of France?") print(f"Tokens saved: {model.total_tokens_saved}") print(model.get_savings_summary()) # {'total_requests': 1, 'total_tokens_saved': 245, 'average_savings_percent': 12.3} ``` Works with any Agno provider: ```python from agno.models.anthropic import Claude from agno.models.google import Gemini claude_model = HeadroomAgnoModel(Claude(id="claude-sonnet-4-20250514")) gemini_model = HeadroomAgnoModel(Gemini(id="gemini-2.0-flash")) ``` Observability hooks [#observability-hooks] Use hooks for detailed tracking without modifying your model: ```python from headroom.integrations.agno import ( HeadroomAgnoModel, HeadroomPreHook, HeadroomPostHook, ) model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) pre_hook = HeadroomPreHook() post_hook = HeadroomPostHook(token_alert_threshold=10000) agent = Agent( model=model, pre_hooks=[pre_hook], post_hooks=[post_hook], ) response = agent.run("Analyze this large dataset...") # Check for alerts if post_hook.alerts: print(f"{len(post_hook.alerts)} requests exceeded threshold") ``` Or use the convenience factory: ```python from headroom.integrations.agno import create_headroom_hooks pre_hook, post_hook = create_headroom_hooks( token_alert_threshold=5000, log_level="DEBUG", ) ``` Tool-heavy agents [#tool-heavy-agents] Tool outputs (JSON, logs, search results) see the biggest compression gains at 70-90% reduction: ```python from agno.tools.duckduckgo import DuckDuckGoTools model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) agent = Agent( model=model, tools=[DuckDuckGoTools()], show_tool_calls=True, ) response = agent.run("Research the latest AI developments") print(f"Tokens saved: {model.total_tokens_saved}") ``` Async support [#async-support] ```python import asyncio async def process(): model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) response = await model.aresponse(messages) async for chunk in model.aresponse_stream(messages): print(chunk, end="", flush=True) asyncio.run(process()) ``` Standalone message optimization [#standalone-message-optimization] Optimize messages without wrapping a model: ```python from headroom.integrations.agno import optimize_messages optimized, metrics = optimize_messages(messages, model="gpt-4o") print(f"Tokens saved: {metrics['tokens_saved']}") ``` Session management [#session-management] Reset metrics between sessions: ```python model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o")) # Session 1 agent.run("First conversation...") print(model.get_savings_summary()) # Reset for new session model.reset() # Session 2 starts fresh agent.run("Second conversation...") ``` Supported providers [#supported-providers] | Provider | Agno Model | Auto-Detected | | --------- | -------------------------- | ------------- | | OpenAI | `OpenAIChat`, `OpenAILike` | Yes | | Anthropic | `Claude`, `AwsBedrock` | Yes | | Google | `Gemini`, `VertexAI` | Yes | | Groq | `Groq` | Yes | | Mistral | `Mistral` | Yes | | Ollama | `Ollama` | Yes | # Anthropic SDK (/docs/anthropic-sdk) Headroom wraps the Anthropic TypeScript SDK to automatically compress messages before every `messages.create()` call. All other methods pass through unchanged. Installation [#installation] ```bash npm install headroom-ai @anthropic-ai/sdk ``` The TypeScript SDK sends messages to a local Headroom proxy for compression. Start the proxy before using the SDK: ```bash pip install "headroom-ai[proxy]" headroom proxy ``` Quick start [#quick-start] ```ts twoslash import { withHeadroom } from 'headroom-ai/anthropic'; import Anthropic from '@anthropic-ai/sdk'; const client = withHeadroom(new Anthropic()); const response = await client.messages.create({ model: 'claude-sonnet-4-5-20250929', messages: longConversation, max_tokens: 1024, }); ``` Every call to `client.messages.create()` compresses messages first. The response format is identical to the unwrapped client. How it works [#how-it-works] `withHeadroom()` returns a proxy around your Anthropic client that intercepts `messages.create()`: 1. Converts Anthropic-format messages to OpenAI format (the compression engine's native format) 2. Sends them to the Headroom proxy's `/v1/compress` endpoint 3. Converts the compressed messages back to Anthropic format 4. Forwards the request to Anthropic as normal Message format conversion [#message-format-conversion] The adapter handles the full Anthropic message format including content blocks: | Anthropic format | OpenAI format | | ----------------------------------------------- | --------------------------------------------------------- | | `{ type: "text", text: "..." }` | `{ role: "user", content: "..." }` | | `{ type: "tool_use", id, name, input }` | `{ tool_calls: [{ id, function: { name, arguments } }] }` | | `{ type: "tool_result", tool_use_id, content }` | `{ role: "tool", tool_call_id, content }` | This conversion is lossless. Your request and response behave identically to an unwrapped client. Options [#options] Pass compression options as the second argument: ```ts twoslash import { withHeadroom } from 'headroom-ai/anthropic'; import Anthropic from '@anthropic-ai/sdk'; const client = withHeadroom(new Anthropic(), { model: 'claude-sonnet-4-5-20250929', baseUrl: 'http://localhost:8787', }); ``` Streaming [#streaming] Streaming works normally. Compression happens before the request: ```ts twoslash import { withHeadroom } from 'headroom-ai/anthropic'; import Anthropic from '@anthropic-ai/sdk'; const client = withHeadroom(new Anthropic()); const stream = await client.messages.create({ model: 'claude-sonnet-4-5-20250929', messages: longConversation, max_tokens: 1024, stream: true, }); ``` Tool use [#tool-use] Tool results are where compression has the biggest impact. Large JSON payloads from tool calls are compressed automatically: ```ts twoslash import { withHeadroom } from 'headroom-ai/anthropic'; import Anthropic from '@anthropic-ai/sdk'; const client = withHeadroom(new Anthropic()); const response = await client.messages.create({ model: 'claude-sonnet-4-5-20250929', max_tokens: 1024, messages: [ { role: 'user', content: 'What went wrong?' }, { role: 'assistant', content: [ { type: 'tool_use', id: 'toolu_1', name: 'get_logs', input: { service: 'api' } }, ], }, { role: 'user', content: [ { type: 'tool_result', tool_use_id: 'toolu_1', content: hugeLogOutput, // Compressed automatically }, ], }, ], tools: [{ name: 'get_logs', description: 'Get logs', input_schema: { type: 'object', properties: {} } }], }); ``` # API Reference (/docs/api-reference) Complete API reference for the Headroom Python and TypeScript SDKs. Core [#core] HeadroomClient [#headroomclient] The main entry point for the Headroom SDK. ```ts twoslash import { HeadroomClient } from 'headroom-ai'; const client = new HeadroomClient({ baseUrl: 'http://localhost:8787', apiKey: 'your-api-key', timeout: 30_000, fallback: true, retries: 2, }); ``` **Constructor Parameters** ```python from headroom import HeadroomClient, OpenAIProvider from openai import OpenAI client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(), default_mode="optimize", ) ``` chat.completions.create() [#chatcompletionscreate] Create a chat completion with optional optimization. The TypeScript SDK uses `compress()` to optimize messages before sending them to your LLM client: ```ts twoslash import { compress } from 'headroom-ai'; const result = await compress(messages, { model: 'gpt-4o', tokenBudget: 100_000, }); // Then pass result.messages to your LLM client ``` Accepts all standard OpenAI/Anthropic parameters plus Headroom-specific overrides: ```python response = client.chat.completions.create( model="gpt-4o", messages=[...], headroom_mode="optimize", headroom_keep_turns=5, headroom_tool_profiles={ "important_tool": {"skip_compression": True}, }, ) ``` chat.completions.simulate() [#chatcompletionssimulate] Preview optimization without making an API call. ```python plan = client.chat.completions.simulate( model="gpt-4o", messages=[...], ) print(f"Tokens: {plan.tokens_before} -> {plan.tokens_after}") print(f"Savings: {plan.savings_percent:.1f}%") print(f"Transforms: {plan.transforms_applied}") ``` **Returns:** `SimulationResult` compress() (TypeScript) [#compress-typescript] Top-level function to compress messages via the Headroom proxy. ```ts twoslash import { compress } from 'headroom-ai'; const result = await compress(messages, { model: 'gpt-4o', baseUrl: 'http://localhost:8787', timeout: 15_000, fallback: true, retries: 2, tokenBudget: 100_000, }); ``` get_stats() [#get_stats] Quick stats for the current session (no database query). ```python stats = client.get_stats() # Returns dict with "session", "config", and "transforms" keys ``` get_metrics() [#get_metrics] Query stored metrics from the database. ```python from datetime import datetime, timedelta metrics = client.get_metrics( start_time=datetime.utcnow() - timedelta(hours=1), limit=100, ) ``` get_summary() [#get_summary] Aggregate statistics across all stored metrics. ```python summary = client.get_summary() # Returns dict with total_requests, total_tokens_saved, # avg_compression_ratio, total_cost_saved_usd ``` validate_setup() [#validate_setup] Validate that the client is configured correctly. ```python result = client.validate_setup() if not result["valid"]: for issue in result["issues"]: print(f" - {issue}") ``` *** Configuration [#configuration] SmartCrusherConfig [#smartcrusherconfig] ```python from headroom import SmartCrusherConfig config = SmartCrusherConfig( min_tokens_to_crush=200, max_items_after_crush=50, keep_first=3, keep_last=2, relevance_threshold=0.3, anomaly_std_threshold=2.0, preserve_errors=True, ) ``` CacheAlignerConfig [#cachealignerconfig] ```python from headroom import CacheAlignerConfig config = CacheAlignerConfig( enabled=True, extract_dates=True, normalize_whitespace=True, stable_prefix_min_tokens=100, ) ``` RollingWindowConfig [#rollingwindowconfig] ```python from headroom import RollingWindowConfig config = RollingWindowConfig( max_tokens=100000, preserve_system=True, preserve_recent_turns=5, drop_oldest_first=True, ) ``` IntelligentContextConfig [#intelligentcontextconfig] ```python from headroom.config import IntelligentContextConfig, ScoringWeights config = IntelligentContextConfig( enabled=True, keep_system=True, keep_last_turns=2, output_buffer_tokens=4000, use_importance_scoring=True, scoring_weights=ScoringWeights(), toin_integration=True, ) ``` ScoringWeights [#scoringweights] Weights are automatically normalized to sum to 1.0. HeadroomConfig [#headroomconfig] The top-level config object that contains all sub-configurations: ```python from headroom import HeadroomConfig config = HeadroomConfig() config.smart_crusher.min_tokens_to_crush = 100 config.cache_aligner.enabled = True config.rolling_window.preserve_recent_turns = 3 ``` RelevanceScorerConfig [#relevancescorerconfig] *** Results [#results] CompressResult (TypeScript) [#compressresult-typescript] SimulationResult (Python) [#simulationresult-python] WasteSignals (Python) [#wastesignals-python] RequestMetrics (Python) [#requestmetrics-python] *** Providers [#providers] OpenAIProvider [#openaiprovider] ```python from headroom import OpenAIProvider provider = OpenAIProvider( enable_prefix_caching=True, ) counter = provider.get_token_counter("gpt-4o") tokens = counter.count_text("Hello, world!") limit = provider.get_context_limit("gpt-4o") # 128000 cost = provider.estimate_cost(input_tokens=1000, output_tokens=500, model="gpt-4o") ``` AnthropicProvider [#anthropicprovider] ```python from headroom import AnthropicProvider from anthropic import Anthropic provider = AnthropicProvider( client=Anthropic(), enable_cache_control=True, ) counter = provider.get_token_counter("claude-3-5-sonnet-latest") tokens = counter.count_messages(messages) # Accurate count via API ``` GoogleProvider [#googleprovider] ```python from headroom import GoogleProvider provider = GoogleProvider( enable_context_caching=True, ) ``` *** Relevance Scoring [#relevance-scoring] create_scorer() [#create_scorer] Factory function to create scorers: ```python from headroom import create_scorer # Auto-select best available scorer scorer = create_scorer() # Explicitly choose type scorer = create_scorer(scorer_type="hybrid", alpha=0.7) ``` BM25Scorer [#bm25scorer] Fast keyword-based scoring (zero dependencies): ```python from headroom import BM25Scorer scorer = BM25Scorer() scores = scorer.score_items(items=["item 1", "item 2"], query="search query") ``` EmbeddingScorer [#embeddingscorer] Semantic similarity scoring (requires `headroom-ai[relevance]`): ```python from headroom import EmbeddingScorer, embedding_available if embedding_available(): scorer = EmbeddingScorer(model="all-MiniLM-L6-v2") scores = scorer.score_items(items, query) ``` HybridScorer [#hybridscorer] Combines BM25 and embeddings: ```python from headroom import HybridScorer scorer = HybridScorer(alpha=0.5) # 50% BM25, 50% embedding scores = scorer.score_items(items, query) ``` *** Transforms (Direct Use) [#transforms-direct-use] SmartCrusher [#smartcrusher] ```python from headroom import SmartCrusher crusher = SmartCrusher() result = crusher.crush(data={"results": [...]}, query="user query") ``` CacheAligner [#cachealigner] ```python from headroom import CacheAligner aligner = CacheAligner() result = aligner.align(messages) ``` RollingWindow [#rollingwindow] ```python from headroom import RollingWindow window = RollingWindow(config) result = window.apply(messages, max_tokens=100000) ``` IntelligentContextManager [#intelligentcontextmanager] ```python from headroom.transforms import IntelligentContextManager from headroom.config import IntelligentContextConfig config = IntelligentContextConfig( keep_system=True, keep_last_turns=2, use_importance_scoring=True, ) manager = IntelligentContextManager(config, toin=toin) result = manager.apply(messages, tokenizer, model_limit=128000) ``` TransformPipeline [#transformpipeline] ```python from headroom import TransformPipeline pipeline = TransformPipeline([ SmartCrusher(), CacheAligner(), RollingWindow(), ]) result = pipeline.transform(messages) ``` *** Errors [#errors] | Exception | Meaning | | ------------------------- | ------------------------------------------------------- | | `HeadroomError` | Base class for all errors | | `HeadroomConnectionError` | Cannot reach proxy | | `HeadroomAuthError` | 401 from proxy | | `HeadroomCompressError` | Compression failed (includes `statusCode`, `errorType`) | | `ConfigurationError` | Invalid configuration | | `ProviderError` | Provider issues | | `StorageError` | Storage failures | | `TokenizationError` | Token counting failed | | `CacheError` | Cache operations failed | | `ValidationError` | Validation failures | | `TransformError` | Transform execution failed | Use `mapProxyError(status, type, message)` to convert proxy error responses to the correct class. | Exception | Meaning | | -------------------- | ------------------------------------ | | `HeadroomError` | Base class for all Headroom errors | | `ConfigurationError` | Invalid config values | | `ProviderError` | Provider issue (unknown model, etc.) | | `StorageError` | Database issue | | `CompressionError` | Compression failed (rare) | | `ValidationError` | Setup validation failed | All exceptions include a `details` dict with additional context. *** Utilities [#utilities] Tokenizer [#tokenizer] ```python from headroom import Tokenizer, count_tokens_text, count_tokens_messages # Quick counting tokens = count_tokens_text("Hello, world!", model="gpt-4o") # With tokenizer instance tokenizer = Tokenizer(model="gpt-4o") tokens = tokenizer.count_text("Hello") tokens = tokenizer.count_messages(messages) ``` generate_report() [#generate_report] Generate HTML/Markdown reports from stored metrics: ```python from headroom import generate_report report = generate_report( store_url="sqlite:///headroom.db", format="html", period="day", ) ``` *** TypeScript Message Types [#typescript-message-types] The TypeScript SDK uses the standard OpenAI message format with `SystemMessage`, `UserMessage`, `AssistantMessage`, and `ToolMessage` variants. # Architecture (/docs/architecture) Headroom sits between your application and the LLM provider. It intercepts messages, compresses them intelligently, and forwards the optimized request. The response comes back unchanged. High-Level Flow [#high-level-flow] ``` +---------------------------------------------------------------+ | YOUR APPLICATION | +---------------------------------------------------------------+ | v +---------------------------------------------------------------+ | HEADROOM CLIENT | | +-----------+ +------------+ +---------+ | | | ANALYZE | > | TRANSFORM | > | CALL | | | | (Parser) | | (Pipeline)| | (API) | | | +-----------+ +------------+ +---------+ | | | | | | | v v v | | Count tokens Apply compressions Send to LLM provider | | Detect waste Preserve meaning Log metrics | +---------------------------------------------------------------+ | v +---------------------------------------------------------------+ | OPENAI / ANTHROPIC / GOOGLE | +---------------------------------------------------------------+ ``` Entry Points [#entry-points] Headroom can be used in three ways, all feeding into the same pipeline: | Entry Point | How It Works | Code Changes | | ---------------- | ------------------------------------------------ | ---------------------------------- | | **SDK Mode** | Wrap your LLM client with `HeadroomClient` | Minimal -- swap client constructor | | **Proxy Mode** | Run `headroom proxy` and point your client at it | Zero -- just change the base URL | | **Integrations** | LangChain, Vercel AI SDK, Agno adapters | Framework-specific setup | The Transform Pipeline [#the-transform-pipeline] Messages flow through a sequence of transforms. Each transform is independent, safe to skip, and fails gracefully (returns original content unchanged). Stage 1: Cache Aligner [#stage-1-cache-aligner] Extracts dynamic content (dates, UUIDs, session tokens) from your system prompt and moves it to the end. This stabilizes the prefix so provider caches (Anthropic `cache_control`, OpenAI prefix caching) can hit on repeated calls. ``` Before: "You are helpful. Current Date: 2024-12-15" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Changes daily = cache miss every day After: "You are helpful." [stable prefix] "[Context: Current Date: 2024-12-15]" [dynamic tail] ``` Overhead: sub-millisecond. Stage 2: Smart Crusher [#stage-2-smart-crusher] Analyzes tool output content and compresses it using statistical methods. This is where the bulk of token savings come from. **What it does:** 1. Parses JSON arrays in tool outputs 2. Runs field-level statistical analysis (variance, uniqueness, change points) 3. Selects a representative subset using the Kneedle algorithm on bigram coverage 4. Preserves errors, anomalies, and distribution boundaries unconditionally 5. Factors out constant fields shared by all items **Strategies by content type:** | Content | Strategy | Typical Savings | | ---------------------- | ------------------------------------------- | --------------- | | JSON arrays of dicts | Statistical sampling + anomaly preservation | 83--95% | | JSON arrays of strings | Dedup + adaptive sampling | 60--90% | | JSON arrays of numbers | Statistical summary + outlier preservation | 70--85% | | Build/test logs | Pattern clustering | 85--94% | | HTML | Article extraction (trafilatura-based) | \~95% | **Item retention split:** 30% from array start (schema), 15% from end (recency), 55% by importance score. Error items are always kept regardless of budget. Overhead: 1--50ms for typical payloads. Scales linearly with input size. Stage 3: Context Manager [#stage-3-context-manager] Ensures the final message array fits within the model's context window. **Rolling Window** (default): Drops oldest messages first, preserving system prompt and recent turns. Tool calls and their responses are dropped as atomic units. **Intelligent Context** (advanced): Scores every message on six dimensions (recency, semantic similarity, TOIN importance, error indicators, forward references, token density) and drops the lowest-scored messages first. Dropped messages are stored in CCR for potential retrieval. Overhead: sub-millisecond for Rolling Window; depends on scoring config for Intelligent Context. Provider Cache Optimization [#provider-cache-optimization] After the pipeline, Headroom applies provider-specific cache hints: | Provider | Mechanism | Savings | | --------- | --------------------------------------- | -------------------------- | | Anthropic | `cache_control` blocks on stable prefix | Up to 90% on cached tokens | | OpenAI | Prefix alignment for automatic caching | Up to 50% on cached tokens | | Google | `CachedContent` API | Up to 75% on cached tokens | CCR: Compress-Cache-Retrieve [#ccr-compress-cache-retrieve] When SmartCrusher compresses a tool output or Intelligent Context drops messages, the original content is stored in a local compression cache. If the LLM needs the full data, it can request retrieval via a `ccr_retrieve` tool call. This makes compression reversible. ``` Compress: 1000 items -> 15 items (stored original in CCR) Cache: Hash-indexed local store (SQLite) Retrieve: LLM calls ccr_retrieve("abc123") -> original 1000 items ``` TOIN: Tool Output Intelligence Network [#toin-tool-output-intelligence-network] TOIN learns compression patterns across sessions and users. When a tool is used repeatedly, TOIN builds up statistics about which fields matter, which items get retrieved, and what compression strategies work best. These learned patterns feed back into SmartCrusher and Intelligent Context scoring. Cold start: For new tool types, TOIN falls back to statistical heuristics. Patterns build up over time as tools are used. What Headroom Does NOT Touch [#what-headroom-does-not-touch] * **User messages**: Never compressed (the user's intent must be preserved exactly) * **System prompts**: Content preserved; only dynamic parts are relocated for caching * **Code**: Passes through unchanged unless tree-sitter AST compression is explicitly enabled * **Model responses**: Returned unchanged from the provider * **Short content**: Tool outputs under 200 tokens pass through (overhead exceeds savings) # Benchmarks (/docs/benchmarks) Headroom's core promise: compress context without losing accuracy. This page covers compression benchmarks, accuracy evaluations, latency overhead, and production telemetry. Compression Performance [#compression-performance] Tested on Apple M-series (CPU), Headroom v0.5.18. Each test runs `compress()` on realistic tool outputs. | Content Type | Original | Compressed | Saved | Ratio | Latency | | --------------------------- | ---------- | ---------- | ---------- | --------- | ------- | | JSON array (100 items) | 3,163 | 297 | 2,866 | **90.6%** | 1ms | | JSON array (500 items) | 9,526 | 1,614 | 7,912 | **83.1%** | 2ms | | Shell output (200 lines) | 3,238 | 469 | 2,769 | **85.5%** | 1ms | | Build log (200 lines) | 2,412 | 148 | 2,264 | **93.9%** | 1ms | | grep results (150 hits) | 2,624 | 2,624 | 0 | 0.0% | \<1ms | | Python source (\~480 lines) | 2,958 | 2,958 | 0 | 0.0% | \<1ms | | **Total** | **23,921** | **8,110** | **15,811** | **66.1%** | **5ms** | grep results and Python source show 0% compression. These are already compact structured formats. SmartCrusher only compresses JSON arrays; code passes through to preserve correctness. Accuracy Benchmarks [#accuracy-benchmarks] HTML Extraction [#html-extraction] **Dataset**: Scrapinghub Article Extraction Benchmark (181 HTML pages with ground truth) | Metric | Value | | --------------- | ----- | | **F1 Score** | 0.919 | | **Precision** | 0.879 | | **Recall** | 0.982 | | **Compression** | 94.9% | For LLM applications, recall is critical -- 98.2% means nearly all article content is preserved. The slight precision drop (some extra content) does not hurt LLM accuracy. JSON Compression (SmartCrusher) [#json-compression-smartcrusher] **Test**: 100 production log entries with critical error at position 67. Task: find the error, error code, resolution, and affected count. | Metric | Baseline | Headroom | | --------------- | -------- | --------- | | Input tokens | 10,144 | 1,260 | | Correct answers | 4/4 | **4/4** | | Compression | -- | **87.6%** | SmartCrusher preserves first N items (schema), last N items (recency), all anomalies (errors, warnings), and statistical distribution. QA Accuracy Preservation [#qa-accuracy-preservation] | Metric | Original HTML | Extracted | Delta | | ----------- | ------------- | --------- | ----- | | F1 Score | 0.85 | 0.87 | +0.02 | | Exact Match | 60% | 62% | +2% | Removing HTML noise sometimes helps LLMs focus on relevant content, leading to slightly higher scores on extraction benchmarks. Latency Overhead [#latency-overhead] SDK Compression Latency [#sdk-compression-latency] Measured per-scenario on Apple M-series (CPU): | Scenario | Tokens In | Tokens Out | Saved | p50 (ms) | p95 (ms) | | -------------------------------- | --------- | ---------- | ----- | -------- | -------- | | JSON: Search Results (100 items) | 10.2K | 1.5K | 8.7K | 189 | 231 | | JSON: Search Results (500 items) | 50.2K | 1.5K | 48.7K | 943 | 955 | | JSON: Search Results (1K items) | 100.5K | 1.5K | 99.0K | 2,012 | 2,198 | | JSON: API Responses (500 items) | 38.9K | 1.1K | 37.8K | 743 | 776 | | JSON: Database Rows (1K rows) | 43.7K | 605 | 43.1K | 961 | 1,104 | | JSON: String Array (100 strings) | 1.1K | 231 | 820 | 15 | 15 | | JSON: String Array (500 strings) | 4.9K | 233 | 4.6K | 72 | 80 | | JSON: Number Array (200 numbers) | 1.2K | 192 | 1.1K | 31 | 62 | | JSON: Mixed Array (250 items) | 2.3K | 368 | 1.9K | 38 | 40 | Cost-Benefit Analysis [#cost-benefit-analysis] Net latency benefit = LLM time saved from fewer tokens minus compression overhead (at Claude Sonnet pricing, $3.0/MTok): | Scenario | Compress (ms) | LLM Saved (ms) | Net Benefit | Savings per 1K Requests | | -------------------------------- | ------------- | -------------- | ----------- | ----------------------- | | JSON: Search Results (100 items) | 189 | 261 | **+72ms** | $26 | | JSON: Search Results (500 items) | 943 | 1,461 | **+518ms** | $146 | | JSON: Search Results (1K items) | 2,012 | 2,969 | **+957ms** | $297 | | JSON: API Responses (500 items) | 743 | 1,134 | **+391ms** | $113 | | JSON: Database Rows (1K rows) | 961 | 1,292 | **+331ms** | $129 | Compression pays for itself in latency for 11 of 12 tested scenarios against Claude Sonnet. Slower and more expensive models (Opus) benefit even more. Pipeline Step Timing [#pipeline-step-timing] | Step | Median | P90 | Description | | --------------------- | ------ | ----- | -------------------------------- | | `pipeline_total` | 16.9ms | 289ms | Full compression pipeline | | `content_router` | 11.7ms | 259ms | Content detection + routing | | `smart_crusher` | 50.1ms | 50ms | JSON array compression | | `text_compressor` | 32.0ms | 576ms | Text compression (Kompress ONNX) | | `initial_token_count` | 2.9ms | 16ms | Token counting (tiktoken) | ContentRouter accounts for 91--98% of pipeline cost on average. CacheAligner and RollingWindow are sub-millisecond. Production Telemetry [#production-telemetry] Real-world data from **50,000+ proxy sessions** across 250+ unique instances (March--April 2026). Collected via anonymous telemetry (opt-out: `HEADROOM_TELEMETRY=off`). Proxy Overhead [#proxy-overhead] | Percentile | Latency | | ---------------- | -------- | | **Median (P50)** | **52ms** | | P90 | 309ms | | P99 | 4,172ms | | Mean | 161ms | The median 52ms overhead is negligible compared to LLM inference time (typically 2--10 seconds). Compression Rate [#compression-rate] | Percentile | Compression | | ---------- | ----------- | | P25 | 4.8% | | **Median** | **4.8%** | | P75 | 6.9% | | Mean | 11.3% | Median compression is modest because many requests are short conversational turns. Heavy tool-use sessions (file reads, shell output) see 40--80% compression. Fleet Summary [#fleet-summary] | Metric | Value | | ------------------ | -------------------------------- | | Clean instances | 249 | | Total tokens saved | 1.4 billion | | Total savings | \~$4,000 | | OS distribution | Linux 57%, macOS 38%, Windows 5% | Reproducing Results [#reproducing-results] ```bash git clone https://github.com/chopratejas/headroom.git cd headroom pip install -e ".[evals,html]" pytest tests/test_evals/ -v -s ``` # Cache Optimization (/docs/cache-optimization) LLM providers cache prompt prefixes to avoid reprocessing identical input on repeated calls. Headroom's **CacheAligner** stabilizes your message prefixes so these caches actually hit, and then applies provider-specific strategies to maximize savings. How CacheAligner works [#how-cachealigner-works] System prompts often contain dynamic content -- today's date, session IDs, timestamps -- that changes between requests. Even a single character difference at the start of a prompt invalidates the entire provider cache. CacheAligner solves this by extracting dynamic content and moving it to the end of the message, keeping the prefix stable: ``` Before: "You are helpful. Current Date: 2025-04-06" <- changes daily, no cache hit After: "You are helpful." <- stable prefix, cache hit "[Context: Current Date: 2025-04-06]" <- dynamic part moved to tail ``` The prefix stays byte-identical across requests, so the provider's KV cache can reuse previously computed attention states. Provider-specific strategies [#provider-specific-strategies] Each LLM provider implements caching differently. Headroom applies the optimal strategy for each. Anthropic [#anthropic] Anthropic supports explicit `cache_control` blocks that mark content as cacheable. Cached input tokens cost **90% less** than regular input tokens. Headroom automatically inserts `cache_control` breakpoints at the right positions in your messages so that stable prefixes (system prompts, early conversation turns) are cached across requests. | Metric | Value | | ------------------- | --------------------------- | | Cache read discount | 90% off input price | | Cache write cost | 25% premium on first write | | Cache TTL | 5 minutes (extended on hit) | OpenAI [#openai] OpenAI uses automatic **prefix caching** -- if consecutive requests share the same message prefix, the provider reuses cached KV states. No explicit API markers are needed, but the prefix must be byte-identical. CacheAligner ensures your prefixes remain stable by extracting dynamic content, which is the key requirement for OpenAI prefix caching to work. | Metric | Value | | ------------------- | ------------------------ | | Cache read discount | 50% off input price | | Activation | Automatic (prefix match) | | Min prefix length | 1024 tokens | Google [#google] Google provides the **CachedContent API**, which lets you explicitly cache large context (system instructions, documents, tools) and reference it across requests. Cached tokens cost **75% less**. Headroom can manage CachedContent lifecycle automatically, creating and refreshing cached content objects as needed. | Metric | Value | | ------------------- | ---------------------------------- | | Cache read discount | 75% off input price | | Mechanism | Explicit CachedContent API objects | | Min cache size | 32,768 tokens | Configuration [#configuration] ```ts twoslash import { compress } from "headroom-ai"; import type { CacheAlignerConfig, CacheOptimizerConfig, HeadroomConfig, } from "headroom-ai"; // CacheAligner: stabilize prefixes for cache hits const cacheAligner: CacheAlignerConfig = { enabled: true, datePatterns: [ "Today is \\w+ \\d+, \\d{4}", "Current time: .*", ], normalizeWhitespace: true, collapseBlankLines: true, }; // CacheOptimizer: provider-level caching const cacheOptimizer: CacheOptimizerConfig = { enabled: true, autoDetectProvider: true, // Detect Anthropic/OpenAI/Google automatically minCacheableTokens: 1024, }; // Full configuration const config: HeadroomConfig = { cacheAligner, cacheOptimizer, }; // Compress with cache optimization const result = await compress(messages, { model: "claude-sonnet-4-20250514", config, }); ``` ```python from headroom import HeadroomClient, OpenAIProvider, AnthropicProvider, GoogleProvider from headroom.transforms import CacheAlignerConfig from openai import OpenAI # CacheAligner configuration aligner_config = CacheAlignerConfig( enabled=True, dynamic_patterns=[ r"Today is \w+ \d+, \d{4}", r"Current time: .*", r"Session ID: [a-f0-9-]+", ], ) # Provider-specific cache settings # OpenAI: prefix caching (automatic, just keep prefixes stable) client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(enable_prefix_caching=True), enable_cache_optimizer=True, ) # Anthropic: cache_control blocks (90% read discount) from anthropic import Anthropic client = HeadroomClient( original_client=Anthropic(), provider=AnthropicProvider(enable_cache_control=True), enable_cache_optimizer=True, ) # Google: CachedContent API (75% read discount) client = HeadroomClient( original_client=google_client, provider=GoogleProvider(enable_context_caching=True), enable_cache_optimizer=True, ) ``` How savings compound [#how-savings-compound] CacheAligner and provider caching work together with Headroom's compression transforms: 1. **SmartCrusher** reduces token count by 70-90% 2. **CacheAligner** stabilizes prefixes so provider caches hit 3. **Provider caching** discounts the remaining input tokens by 50-90% For example, with Anthropic: * 100K input tokens compressed to 20K (80% savings from SmartCrusher) * 18K of those 20K hit the cache (90% cache read discount) * Effective cost: 2K full-price tokens + 18K at 10% = 3.8K equivalent tokens * **Total savings: 96.2%** compared to the original 100K tokens # Reversible Compression (CCR) (/docs/ccr) Headroom's CCR (Compress-Cache-Retrieve) architecture makes compression **reversible**. When content is compressed, the original data is cached locally. If the LLM needs the full data, it retrieves it instantly. Unlike traditional lossy compression, CCR guarantees that every piece of original data remains accessible. You get 70-90% token savings with zero risk of permanent data loss. The problem with traditional compression [#the-problem-with-traditional-compression] Traditional compression forces a difficult tradeoff: * **Aggressive compression** risks losing data the LLM needs * **Conservative compression** misses out on token savings CCR eliminates this tradeoff entirely. Compress aggressively, retrieve on demand. Architecture [#architecture] CCR flows through four phases: ``` TOOL OUTPUT (1000 items) -> SmartCrusher compresses to 20 items -> Original cached with hash=abc123 -> Retrieval tool injected into context LLM PROCESSING Option A: LLM solves task with 20 items -> Done (90% savings) Option B: LLM calls headroom_retrieve(hash=abc123) -> Response Handler returns full data automatically ``` Phase 1: Compression Store [#phase-1-compression-store] When SmartCrusher compresses tool output: 1. The original content is stored in an LRU cache 2. A hash key is generated for retrieval 3. A marker is added to the compressed output: ``` [1000 items compressed to 20. Retrieve more: hash=abc123] ``` Phase 2: Tool Injection [#phase-2-tool-injection] Headroom injects a `headroom_retrieve` tool into the LLM's available tools: ```json { "name": "headroom_retrieve", "description": "Retrieve original uncompressed data from Headroom cache", "parameters": { "hash": "The hash key from the compression marker", "query": "Optional: search within the cached data" } } ``` The LLM sees this tool alongside your application's tools and can call it whenever the compressed data is insufficient. Phase 3: Response Handler [#phase-3-response-handler] When the LLM calls `headroom_retrieve`: 1. The Response Handler intercepts the tool call 2. Data is retrieved from the local cache (around 1ms) 3. The result is added to the conversation 4. The API call continues automatically The client never sees CCR tool calls -- they are handled transparently by Headroom. Phase 4: Context Tracker [#phase-4-context-tracker] Across multiple turns, the Context Tracker maintains awareness of all compressed content: 1. Remembers what was compressed in earlier turns 2. Analyzes new queries for relevance to compressed content 3. Proactively expands relevant data before the LLM asks ``` Turn 1: User searches for files -> 500 files compressed to 15, cached (hash=abc123) -> LLM answers with 15 files Turn 5: User asks "What about the auth middleware?" -> Context Tracker detects "auth" may match cached content -> Proactively expands compressed data -> LLM finds auth_middleware.py in the full list ``` BM25 search within compressed data [#bm25-search-within-compressed-data] The LLM does not have to retrieve everything. It can search within compressed data using the optional `query` parameter: ```json { "name": "headroom_retrieve", "parameters": { "hash": "abc123", "query": "authentication errors" } } ``` This runs a BM25 search over the cached items, returning only the relevant subset instead of the full original payload. Retrieving originals [#retrieving-originals] CCR works automatically through the proxy, but you can also retrieve cached data programmatically: ```ts twoslash import { compress } from "headroom-ai"; import type { CCRConfig } from "headroom-ai"; // CCR is enabled by default when compressing through the proxy. const result = await compress(messages, { model: "gpt-4o", }); // Access compressed messages — CCR markers are embedded automatically console.log(result.messages); // CCR configuration options const ccrConfig: CCRConfig = { enabled: true, injectTool: true, // Inject headroom_retrieve tool injectRetrievalMarker: true, // Add retrieval markers to compressed output feedbackEnabled: true, // Learn from retrieval patterns storeMaxEntries: 1000, // Max cached items storeTtlSeconds: 3600, // Cache TTL }; ``` ```python from headroom import HeadroomClient, OpenAIProvider from openai import OpenAI client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(), default_mode="optimize", ) # CCR happens automatically during chat completions. # The LLM calls headroom_retrieve when it needs more data. response = client.chat.completions.create( model="gpt-4o", messages=messages, ) # CCR is enabled by default. To disable: # headroom proxy --no-ccr-responses # To disable proactive expansion: # headroom proxy --no-ccr-expansion ``` Message-level CCR [#message-level-ccr] CCR is not limited to tool outputs. When IntelligentContext drops low-importance messages to fit the context budget, those messages are also stored in CCR: ``` 100-message conversation (50K tokens) -> IntelligentContext scores messages by importance -> Drops 60 low-scoring messages -> Dropped messages cached with hash=def456 -> Marker inserted: "60 messages dropped, retrieve: def456" ``` The marker includes the CCR reference so the LLM can recover earlier context: ``` [Earlier context compressed: 60 message(s) dropped by importance scoring. Full content available via ccr_retrieve tool with reference 'def456'.] ``` When users retrieve dropped messages via CCR, TOIN learns those message patterns are important and scores them higher in future sessions -- improving drop decisions across all users. CCR-enabled components [#ccr-enabled-components] | Component | What it compresses | CCR integration | | ---------------------- | -------------------------------- | --------------------------------------------- | | **SmartCrusher** | JSON arrays (tool outputs) | Stores original array, marker includes hash | | **ContentRouter** | Code, logs, search results, text | Stores original content by strategy | | **IntelligentContext** | Messages (conversation turns) | Stores dropped messages, marker includes hash | Why CCR matters [#why-ccr-matters] | Approach | Risk | Savings | | ----------------------- | ----------------- | ------- | | No compression | None | 0% | | Traditional compression | Data loss | 70-90% | | CCR compression | None (reversible) | 70-90% | CCR gives you the savings of aggressive compression with zero risk. The LLM can always retrieve the original data if needed. # Code Compression (/docs/code-compression) Headroom's CodeAwareCompressor uses tree-sitter to parse source code into an AST, then selectively compresses function bodies while preserving the structural elements that LLMs need -- imports, signatures, type annotations, and error handlers. Why AST-Aware Compression? [#why-ast-aware-compression] Naive truncation breaks code. Cutting a function in half leaves invalid syntax that confuses the LLM. CodeAwareCompressor guarantees: * **Syntax validity** -- output always parses correctly * **Structural preservation** -- imports, signatures, types, decorators are kept intact * **Lightweight** -- \~50MB (tree-sitter) vs \~1GB for LLMLingua Supported Languages [#supported-languages] | Tier | Languages | Support Level | | ------ | ------------------------------ | ------------------------- | | Tier 1 | Python, JavaScript, TypeScript | Full AST analysis | | Tier 2 | Go, Rust, Java, C, C++ | Function body compression | What Gets Preserved vs Compressed [#what-gets-preserved-vs-compressed] **Always preserved:** * Import statements * Function and method signatures * Class definitions * Type annotations * Decorators * Error handlers (`try`/`except`, `try`/`catch`) **Compressed:** * Function bodies (implementations) * Comments (unless configured to preserve) * Verbose docstrings (configurable: full, first line, or removed) Example [#example] ```python from headroom.transforms import CodeAwareCompressor compressor = CodeAwareCompressor() code = ''' import os from typing import List def process_items(items: List[str]) -> List[str]: """Process a list of items.""" results = [] for item in items: if not item: continue processed = item.strip().lower() results.append(processed) return results ''' result = compressor.compress(code, language="python") print(result.compressed) # import os # from typing import List # # def process_items(items: List[str]) -> List[str]: # """Process a list of items.""" # results = [] # for item in items: # # ... (5 lines compressed) # pass print(f"Compression: {result.compression_ratio:.0%}") # ~55% print(f"Syntax valid: {result.syntax_valid}") # True ``` Configuration [#configuration] ```python from headroom.transforms import CodeAwareCompressor, CodeCompressorConfig, DocstringMode config = CodeCompressorConfig( preserve_imports=True, # Always keep imports preserve_signatures=True, # Always keep function signatures preserve_type_annotations=True, # Keep type hints preserve_error_handlers=True, # Keep try/except blocks preserve_decorators=True, # Keep decorators docstring_mode=DocstringMode.FIRST_LINE, # FULL, FIRST_LINE, REMOVE target_compression_rate=0.2, # Keep 20% of tokens max_body_lines=5, # Lines to keep per function body min_tokens_for_compression=100, # Skip small content language_hint=None, # Auto-detect if None fallback_to_llmlingua=True, # Use LLMLingua for unknown langs ) compressor = CodeAwareCompressor(config) result = compressor.compress(code) ``` Configuration Options [#configuration-options] | Option | Default | Description | | ---------------------------- | ------------ | -------------------------------------------------------- | | `preserve_imports` | `True` | Keep all import statements | | `preserve_signatures` | `True` | Keep function/method signatures | | `preserve_type_annotations` | `True` | Keep type hints | | `preserve_error_handlers` | `True` | Keep try/except blocks | | `preserve_decorators` | `True` | Keep decorators | | `docstring_mode` | `FIRST_LINE` | How to handle docstrings: `FULL`, `FIRST_LINE`, `REMOVE` | | `target_compression_rate` | `0.2` | Fraction of tokens to keep (0.2 = keep 20%) | | `max_body_lines` | `5` | Max lines to keep per function body | | `min_tokens_for_compression` | `100` | Skip files smaller than this | | `language_hint` | `None` | Override language detection | | `fallback_to_llmlingua` | `True` | Use LLMLingua for unsupported languages | Before and After [#before-and-after] ```python # Before (full source file) def process_data(items: List[str]) -> Dict[str, int]: """Process items and count occurrences.""" result = {} for item in items: item = item.strip().lower() if item in result: result[item] += 1 else: result[item] = 1 return result # After (signature preserved, body compressed) def process_data(items: List[str]) -> Dict[str, int]: """Process items and count occurrences.""" result = {} for item in items: # ... (5 lines compressed) pass ``` The LLM sees the function's purpose, its input/output types, and the general approach -- enough to reason about the code without needing every implementation line. Installation [#installation] ```bash # Install tree-sitter language pack pip install "headroom-ai[code]" ``` Memory Management [#memory-management] Tree-sitter parsers are lazy-loaded and cached. You can free memory when done: ```python from headroom.transforms import is_tree_sitter_available, unload_tree_sitter # Check if tree-sitter is installed print(is_tree_sitter_available()) # True # Free memory when done unload_tree_sitter() ``` Performance [#performance] | Metric | Value | | --------------- | ---------------------------- | | Compression | 40-70% token reduction | | Speed | \~10-50ms per file | | Memory | \~50MB (tree-sitter parsers) | | Syntax validity | Guaranteed | When you use the Headroom proxy or call `compress()`, source code is automatically detected and routed to CodeAwareCompressor. Direct usage gives you control over compression settings per language. # Community Savings (/docs/community-savings) Real-time aggregate metrics from Headroom proxy instances worldwide. All data is anonymous — only token counts, compression ratios, and cost estimates are collected. [Opt out anytime](https://github.com/chopratejas/headroom/blob/main/headroom/telemetry/beacon.py) with `HEADROOM_TELEMETRY=off`. Overview [#overview] Savings Over Time [#savings-over-time] Top Savings by Instance [#top-savings-by-instance] Instance Details [#instance-details] # Configuration (/docs/configuration) Headroom can be configured via the SDK constructor, proxy command line, environment variables, or per-request overrides. Modes [#modes] | Mode | Behavior | Use Case | | ---------- | -------------------------------------- | ------------------------------------------- | | `audit` | Observes and logs, no modifications | Production monitoring, baseline measurement | | `optimize` | Applies safe, deterministic transforms | Production optimization | | `simulate` | Returns plan without API call | Testing, cost estimation | SDK Configuration [#sdk-configuration] ```ts twoslash import { HeadroomClient } from 'headroom-ai'; // Reads from HEADROOM_BASE_URL and HEADROOM_API_KEY automatically const client = new HeadroomClient(); // Or configure explicitly const explicit = new HeadroomClient({ baseUrl: 'http://localhost:8787', apiKey: 'your-api-key', timeout: 30_000, fallback: true, retries: 2, }); ``` ```python from headroom import HeadroomClient, OpenAIProvider from openai import OpenAI client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(), # Mode: "audit" (observe only) or "optimize" (apply transforms) default_mode="optimize", # Enable provider-specific cache optimization enable_cache_optimizer=True, # Enable query-level semantic caching enable_semantic_cache=False, # Override default context limits per model model_context_limits={ "gpt-4o": 128000, "gpt-4o-mini": 128000, }, # Database location (defaults to temp directory) # store_url="sqlite:////absolute/path/to/headroom.db", ) ``` Per-Request Overrides [#per-request-overrides] Override configuration for individual requests: ```ts twoslash import { compress } from 'headroom-ai'; const result = await compress(messages, { model: 'gpt-4o', tokenBudget: 100_000, timeout: 15_000, }); ``` ```python response = client.chat.completions.create( model="gpt-4o", messages=[...], # Override mode for this request headroom_mode="audit", # Reserve more tokens for output headroom_output_buffer_tokens=8000, # Keep last N turns (don't compress) headroom_keep_turns=5, # Skip compression for specific tools headroom_tool_profiles={ "important_tool": {"skip_compression": True} }, ) ``` SmartCrusher Configuration [#smartcrusher-configuration] Fine-tune JSON compression behavior: ```python from headroom.transforms import SmartCrusherConfig config = SmartCrusherConfig( # Maximum items to keep after compression max_items_after_crush=15, # Minimum tokens before applying compression min_tokens_to_crush=200, # Relevance scoring tier: "bm25" (fast) or "embedding" (accurate) relevance_tier="bm25", # Always keep items with these field values preserve_fields=["error", "warning", "failure"], ) ``` CacheAligner Configuration [#cachealigner-configuration] Control prefix stabilization for provider cache hit rates: ```python from headroom.transforms import CacheAlignerConfig config = CacheAlignerConfig( # Enable/disable cache alignment enabled=True, # Patterns to extract from system prompt dynamic_patterns=[ r"Today is \w+ \d+, \d{4}", r"Current time: .*", ], ) ``` RollingWindow Configuration [#rollingwindow-configuration] Control context window management when messages exceed model limits: ```python from headroom.transforms import RollingWindowConfig config = RollingWindowConfig( # Minimum turns to always keep min_keep_turns=3, # Reserve tokens for output output_buffer_tokens=4000, # Drop oldest tool outputs first prefer_drop_tool_outputs=True, ) ``` IntelligentContext Configuration [#intelligentcontext-configuration] Semantic-aware context management with importance scoring: ```python from headroom.config import IntelligentContextConfig, ScoringWeights # Customize scoring weights (must sum to 1.0, or will be normalized) weights = ScoringWeights( recency=0.20, # Newer messages score higher semantic_similarity=0.20, # Similarity to recent context toin_importance=0.25, # TOIN-learned retrieval patterns error_indicator=0.15, # TOIN-learned error field types forward_reference=0.15, # Messages referenced by later messages token_density=0.05, # Information density ) config = IntelligentContextConfig( enabled=True, keep_system=True, # Never drop system messages keep_last_turns=2, # Protect last N user turns output_buffer_tokens=4000, # Reserve for model output use_importance_scoring=True, scoring_weights=weights, toin_integration=True, # Use TOIN patterns if available recency_decay_rate=0.1, # Exponential decay lambda compress_threshold=0.1, # Try compression first if <10% over budget ) ``` Scoring Weights [#scoring-weights] Weights are automatically normalized to sum to 1.0: ```python weights = ScoringWeights(recency=1.0, toin_importance=1.0) normalized = weights.normalized() # recency=0.5, toin_importance=0.5, others=0.0 ``` Proxy Configuration [#proxy-configuration] Command Line Options [#command-line-options] ```bash headroom proxy \ --port 8787 \ # Port to listen on --host 0.0.0.0 \ # Host to bind to --budget 10.00 \ # Daily budget limit in USD --log-file headroom.jsonl # Log file path ``` Feature Flags [#feature-flags] ```bash # Disable optimization (passthrough mode) headroom proxy --no-optimize # Disable semantic caching headroom proxy --no-cache # Enable LLMLingua ML compression headroom proxy --llmlingua headroom proxy --llmlingua --llmlingua-device cuda --llmlingua-rate 0.4 ``` Environment Variables [#environment-variables] | Variable | Description | Default | | ----------------------- | ----------------------------------------------- | -------------------------------- | | `HEADROOM_LOG_LEVEL` | Logging level | `INFO` | | `HEADROOM_STORE_URL` | Database URL | temp directory | | `HEADROOM_DEFAULT_MODE` | Default mode | `optimize` | | `HEADROOM_MODEL_LIMITS` | Custom model config (JSON string or file path) | -- | | `HEADROOM_BASE_URL` | Base URL of the Headroom proxy (TypeScript SDK) | `http://localhost:8787` | | `HEADROOM_API_KEY` | API key for Headroom Cloud authentication | -- | | `HEADROOM_SAVINGS_PATH` | Override persistent savings file location | `~/.headroom/proxy_savings.json` | | `HEADROOM_TELEMETRY` | Set to `off` to disable anonymous telemetry | `on` | Custom Model Configuration [#custom-model-configuration] Configure context limits and pricing for new or custom models: ```json { "anthropic": { "context_limits": { "claude-4-opus-20250301": 200000, "claude-custom-finetune": 128000 }, "pricing": { "claude-4-opus-20250301": { "input": 15.00, "output": 75.00, "cached_input": 1.50 } } }, "openai": { "context_limits": { "gpt-5": 256000, "ft:gpt-4o:my-org": 128000 } } } ``` Save as `~/.headroom/models.json`, or set `HEADROOM_MODEL_LIMITS` to a JSON string or file path. Settings are resolved in this order (later overrides earlier): 1. Built-in defaults 2. `~/.headroom/models.json` config file 3. `HEADROOM_MODEL_LIMITS` environment variable 4. SDK constructor arguments Pattern-Based Inference [#pattern-based-inference] Unknown models are automatically inferred from naming patterns: | Pattern | Inferred Settings | | ------------ | ------------------------------------- | | `*opus*` | 200K context, Opus-tier pricing | | `*sonnet*` | 200K context, Sonnet-tier pricing | | `*haiku*` | 200K context, Haiku-tier pricing | | `gpt-4o*` | 128K context, GPT-4o pricing | | `o1*`, `o3*` | 200K context, reasoning model pricing | Provider-Specific Settings [#provider-specific-settings] ```python from headroom import OpenAIProvider provider = OpenAIProvider( enable_prefix_caching=True, ) ``` ```python from headroom import AnthropicProvider provider = AnthropicProvider( enable_cache_control=True, ) ``` ```python from headroom import GoogleProvider provider = GoogleProvider( enable_context_caching=True, ) ``` Tool Profiles [#tool-profiles] Skip or customize compression for specific tools: ```python response = client.chat.completions.create( model="gpt-4o", messages=messages, headroom_tool_profiles={ "important_tool": {"skip_compression": True}, "search_tool": {"max_items_after_crush": 25}, }, ) ``` Configuration Precedence [#configuration-precedence] Settings are applied in this order (later overrides earlier): 1. Default values 2. Environment variables 3. SDK constructor arguments 4. Per-request overrides Validation [#validation] Validate your configuration at startup: ```python result = client.validate_setup() if not result["valid"]: print("Configuration issues:") for issue in result["issues"]: print(f" - {issue}") ``` # Context Management (/docs/context-management) When conversations grow beyond a model's context window, Headroom decides which messages to keep and which to drop. Instead of naively removing the oldest messages, **IntelligentContext** scores every message by learned importance and drops the least valuable ones first. IntelligentContext [#intelligentcontext] IntelligentContext is a message-level compressor. It analyzes your conversation, assigns an importance score to each message, and removes low-scoring messages until the conversation fits within the token budget. Dropped messages are not lost -- they are stored in [CCR](/docs/ccr) for on-demand retrieval by the LLM. ``` 100-message conversation (50K tokens) with a 32K budget -> Score each message by importance -> Drop 60 lowest-scoring messages -> Cache dropped messages in CCR (hash=def456) -> Insert marker: "60 messages dropped, retrieve: def456" -> Final context: 40 messages within budget ``` Scoring weights [#scoring-weights] Each message receives a weighted score from six factors: | Weight | Default | Description | | --------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------- | | `recency` | 0.20 | Exponential decay from the end of the conversation. Recent messages score higher. | | `semantic_similarity` | 0.20 | Embedding cosine similarity to recent context. Messages related to the current topic score higher. | | `toin_importance` | 0.25 | TOIN retrieval rate -- messages matching patterns that users frequently retrieve via CCR are scored higher. Learned across all users. | | `error_indicator` | 0.15 | TOIN field semantics error detection. Messages containing error patterns (learned, not hardcoded) are preserved. | | `forward_reference` | 0.15 | Count of later messages that reference this one. Messages that other messages depend on are kept. | | `token_density` | 0.05 | Unique tokens divided by total tokens. Dense, information-rich messages score higher than repetitive ones. | Error detection does not rely on keyword matching like "error" or "fail". Instead, it uses TOIN's learned `field_semantics.inferred_type` to identify error-bearing messages -- this adapts to your specific data patterns across sessions and users. Weights are automatically normalized to sum to 1.0, so you can set relative values without worrying about exact proportions. Rolling window fallback [#rolling-window-fallback] If IntelligentContext is disabled or scoring data is unavailable, Headroom falls back to a **rolling window** strategy: * Drop the oldest messages first * Always keep the system prompt * Always keep the last N user/assistant turns * Drop tool calls and their responses as atomic pairs (no orphaned tool data) This provides a safe baseline that works without any learned data. Protection rules [#protection-rules] Headroom enforces several protections to ensure model output quality: Output buffer reservation [#output-buffer-reservation] A configurable number of tokens is reserved for the model's response. The context budget is calculated as: ``` context_budget = model_context_limit - output_buffer_tokens ``` This prevents the input from consuming the entire context window and leaving no room for the model to respond. System message protection [#system-message-protection] System messages are never dropped. They contain critical instructions, persona definitions, and tool descriptions that the model needs throughout the conversation. Turn protection [#turn-protection] The last N user/assistant turns are always preserved, ensuring the model has immediate conversational context. By default, the last 2 turns are protected. Configuration [#configuration] ```ts twoslash import { compress } from "headroom-ai"; import type { IntelligentContextConfig, ScoringWeights, RollingWindowConfig, HeadroomConfig, } from "headroom-ai"; // Scoring weights (normalized automatically) const scoringWeights: ScoringWeights = { recency: 0.20, semanticSimilarity: 0.20, toinImportance: 0.25, errorIndicator: 0.15, forwardReference: 0.15, tokenDensity: 0.05, }; // IntelligentContext configuration const intelligentContext: IntelligentContextConfig = { enabled: true, keepSystem: true, keepLastTurns: 2, outputBufferTokens: 4000, useImportanceScoring: true, scoringWeights, toinIntegration: true, recencyDecayRate: 0.1, compressThreshold: 0.1, }; // Rolling window fallback const rollingWindow: RollingWindowConfig = { enabled: true, keepSystem: true, keepLastTurns: 3, outputBufferTokens: 4000, }; // Full configuration const config: HeadroomConfig = { intelligentContext, rollingWindow, }; const result = await compress(messages, { model: "gpt-4o", config, }); console.log(`Compressed: ${result.tokensBefore} -> ${result.tokensAfter}`); ``` ```python from headroom import HeadroomClient, OpenAIProvider from headroom.config import IntelligentContextConfig, ScoringWeights from openai import OpenAI # Customize scoring weights weights = ScoringWeights( recency=0.20, semantic_similarity=0.20, toin_importance=0.25, error_indicator=0.15, forward_reference=0.15, token_density=0.05, ) context_config = IntelligentContextConfig( enabled=True, keep_system=True, # Never drop system messages keep_last_turns=2, # Protect last 2 user turns output_buffer_tokens=4000, # Reserve for model output use_importance_scoring=True, scoring_weights=weights, toin_integration=True, # Use TOIN patterns recency_decay_rate=0.1, # Exponential decay lambda compress_threshold=0.1, # Try compression first if <10% over budget ) client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(), default_mode="optimize", ) # Per-request overrides response = client.chat.completions.create( model="gpt-4o", messages=messages, headroom_output_buffer_tokens=8000, # More room for long responses headroom_keep_turns=5, # Protect last 5 turns ) ``` How scoring improves over time [#how-scoring-improves-over-time] IntelligentContext integrates with TOIN (Tool-Output Intelligence Network) to learn from real usage: 1. Messages are dropped based on current scores 2. Dropped messages are stored in CCR 3. If the LLM retrieves a dropped message, TOIN records that pattern 4. Future conversations score similar message patterns higher 5. Drop accuracy improves across all users, not just within one session This feedback loop means the system gets smarter the more it is used. Error messages that users frequently need are automatically preserved, while verbose success messages that nobody retrieves are dropped more aggressively. # Error Handling (/docs/errors) Headroom provides explicit exceptions for debugging, with a core safety guarantee: **compression failures never break your LLM calls**. If compression fails, the original content passes through unchanged. Error Hierarchy [#error-hierarchy] ``` HeadroomError (base class) +-- HeadroomConnectionError # Cannot reach proxy +-- HeadroomAuthError # 401 from proxy +-- HeadroomCompressError # Compression failed (with statusCode) +-- ConfigurationError # Invalid configuration +-- ProviderError # Provider issues +-- StorageError # Storage failures +-- TokenizationError # Token counting failed +-- CacheError # Cache operations failed +-- ValidationError # Validation failures +-- TransformError # Transform execution failed ``` ```ts twoslash import { HeadroomError, HeadroomConnectionError, HeadroomAuthError, HeadroomCompressError, ConfigurationError, ProviderError, mapProxyError, } from 'headroom-ai'; ``` ``` HeadroomError (base class) +-- ConfigurationError # Invalid configuration +-- ProviderError # Provider issues (unknown model, etc.) +-- StorageError # Database/storage failures +-- CompressionError # Compression failures (rare) +-- ValidationError # Setup validation failures ``` ```python from headroom import ( HeadroomError, ConfigurationError, ProviderError, StorageError, CompressionError, ValidationError, ) ``` Catching Errors [#catching-errors] ```ts twoslash import { compress, HeadroomConnectionError, HeadroomAuthError, HeadroomCompressError, HeadroomError } from 'headroom-ai'; try { const result = await compress(messages, { model: 'gpt-4o' }); } catch (e) { if (e instanceof HeadroomConnectionError) { console.error('Cannot reach proxy:', e.message); } else if (e instanceof HeadroomAuthError) { console.error('Auth failed:', e.message); } else if (e instanceof HeadroomCompressError) { console.error(`Compress failed (${e.statusCode}):`, e.message); } else if (e instanceof HeadroomError) { console.error('Headroom error:', e.message, e.details); } } ``` ```python from headroom import ( HeadroomClient, HeadroomError, ConfigurationError, StorageError, ) try: client = HeadroomClient(...) response = client.chat.completions.create(...) except ConfigurationError as e: print(f"Config issue: {e}") print(f"Details: {e.details}") except StorageError as e: print(f"Storage issue: {e}") # Headroom continues to work, just without metrics persistence except HeadroomError as e: print(f"Headroom error: {e}") ``` Error Types in Detail [#error-types-in-detail] ConfigurationError [#configurationerror] Raised when configuration is invalid. ```ts twoslash import { ConfigurationError } from 'headroom-ai'; // ConfigurationError is thrown when the proxy returns // a configuration_error type in its error response ``` ```python try: client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(), default_mode="invalid_mode", # Will raise ConfigurationError ) except ConfigurationError as e: print(f"Config error: {e}") print(f"Field: {e.details.get('field')}") ``` ProviderError [#providererror] Raised for provider-specific issues (unknown model, API error, token counting failure). ```python try: response = client.chat.completions.create( model="unknown-model-xyz", messages=[...], ) except ProviderError as e: print(f"Provider error: {e}") print(f"Provider: {e.details.get('provider')}") ``` StorageError [#storageerror] Raised when database operations fail. Storage errors do not affect core functionality -- the application can continue without historical metrics. ```python try: metrics = client.get_metrics() except StorageError as e: metrics = [] # Continue without historical metrics ``` CompressionError [#compressionerror] Raised when compression fails (rare). In practice, compression errors are caught internally and the original content passes through unchanged. This exception is only raised in strict mode. HeadroomConnectionError (TypeScript) [#headroomconnectionerror-typescript] Raised when the TypeScript SDK cannot connect to the Headroom proxy. ```ts twoslash import { compress, HeadroomConnectionError } from 'headroom-ai'; try { await compress(messages, { model: 'gpt-4o' }); } catch (e) { if (e instanceof HeadroomConnectionError) { console.error('Is the proxy running? Start with: headroom proxy'); } } ``` Proxy Error Mapping [#proxy-error-mapping] The TypeScript SDK automatically maps proxy error responses to the correct error class: | HTTP Status | Proxy Error Type | TypeScript Class | | ----------- | --------------------- | ----------------------- | | 401 | -- | `HeadroomAuthError` | | 4xx/5xx | `configuration_error` | `ConfigurationError` | | 4xx/5xx | `provider_error` | `ProviderError` | | 4xx/5xx | `storage_error` | `StorageError` | | 4xx/5xx | `tokenization_error` | `TokenizationError` | | 4xx/5xx | `validation_error` | `ValidationError` | | 4xx/5xx | `transform_error` | `TransformError` | | 4xx/5xx | (other) | `HeadroomCompressError` | The `mapProxyError()` function handles this mapping: ```ts twoslash import { mapProxyError } from 'headroom-ai'; const error = mapProxyError(400, 'configuration_error', 'Invalid mode'); // Returns a ConfigurationError instance ``` Error Details [#error-details] All Headroom exceptions include a `details` dict/object with additional context: ```ts twoslash import { HeadroomError } from 'headroom-ai'; // HeadroomError.details is Record | undefined // HeadroomCompressError also has .statusCode and .errorType ``` ```python try: client = HeadroomClient(...) except HeadroomError as e: print(f"Error: {e}") print(f"Type: {type(e).__name__}") print(f"Details: {e.details}") # Details might include: # - field: which config field caused the error # - provider: which provider was involved # - model: which model was requested # - original_error: underlying exception ``` Safety Guarantee [#safety-guarantee] If compression fails, the original content passes through unchanged. Your LLM calls never fail due to Headroom: ```python messages = [ {"role": "tool", "content": "malformed json {{{"} ] # This will NOT raise an exception # The malformed content passes through unchanged response = client.chat.completions.create( model="gpt-4o", messages=messages, ) ``` Best Practices [#best-practices] 1. **Catch specific exceptions** rather than broad `Exception` to avoid hiding real bugs 2. **Let StorageError pass** -- storage errors do not affect core compression functionality 3. **Validate on startup** with `client.validate_setup()` to catch configuration issues early 4. **Enable logging** at WARNING level to see when compression is skipped ```python import logging logging.basicConfig(level=logging.WARNING) # WARNING:headroom.transforms.smart_crusher:Skipping compression: invalid JSON ``` # Failure Learning (/docs/failure-learning) `headroom learn` analyzes past coding agent sessions, finds what went wrong, correlates each failure with what eventually worked, and writes specific project-level learnings that prevent the same mistakes next session. Quick Start [#quick-start] ```bash # See recommendations for current project (dry-run, no changes) headroom learn # Write recommendations to CLAUDE.md and MEMORY.md headroom learn --apply # Analyze a specific project headroom learn --project ~/my-project --apply # Analyze all projects headroom learn --all --apply ``` Success Correlation [#success-correlation] The core innovation. Instead of cataloging failures ("Read failed 5 times"), Headroom finds what the model did to **fix** each failure: * **Failed**: `Read axion-formats/src/main/java/.../FirstClassEntity.java` * **Then succeeded**: `Read axion-scala-common/src/main/scala/.../FirstClassEntity.scala` * **Learning**: "`FirstClassEntity` is at `axion-scala-common/`, not `axion-formats/`" This produces specific, actionable corrections -- not generic advice. What It Learns [#what-it-learns] Environment Facts [#environment-facts] Which runtime commands work vs fail. ```markdown ### Environment - **Python**: use `uv run python` (not `python3` -- modules not available outside venv) ``` File Path Corrections [#file-path-corrections] Wrong paths the model keeps guessing, with the correct locations. ```markdown ### File Path Corrections - `axion-common/src/.../AxionSparkConstants.scala` -> actually at `axion-spark-common/src/.../AxionSparkConstants.scala` ``` Search Scope [#search-scope] Which directories to search in (narrow paths fail, broader ones work). ```markdown ### Search Scope - Don't search `axion-model/` -> use `axion/` (the repo root) ``` Command Patterns [#command-patterns] How commands should (and should not) be run. ```markdown ### Command Patterns - **user_prefers_manual**: User rejected gradle 18 times -- show the command, don't execute - **python_runtime**: Use `uv run python` not `python3` (ModuleNotFoundError) ``` Known Large Files [#known-large-files] Files that need `offset`/`limit` with Read. ```markdown ### Known Large Files - `proxy/server.py` (~8000 lines) -- always use offset/limit ``` Where Learnings Go [#where-learnings-go] | Pattern | Destination | Why | | ------------------------------------------------------- | ------------- | ------------------------------------------ | | Environment, paths, search scope, commands, large files | **CLAUDE.md** | Stable project facts, version-controllable | | Missing paths, retry patterns, permissions | **MEMORY.md** | May change, agent-specific | CLAUDE.md lives in your project directory. MEMORY.md lives in `~/.claude/projects/*/memory/`. Marker-Based Updates [#marker-based-updates] Headroom manages a clearly-delimited section in each file: ```markdown ## Headroom Learned Patterns *Auto-generated by `headroom learn` -- do not edit manually* ... ``` On re-run, only the content between markers is replaced. Your existing file content is preserved. Architecture [#architecture] The system is built with an adapter pattern so it can support multiple agent systems: * **Scanners** read tool-specific log formats (e.g., `~/.claude/projects/*.jsonl`) and produce normalized `ToolCall` sequences * **Analyzers** work on `ToolCall` data -- same analysis logic for any agent system * **Writers** output to tool-specific context injection mechanisms (e.g., CLAUDE.md) To add support for a new agent (e.g., Cursor), you write a Scanner that reads its log format and a Writer that outputs to `.cursorrules`. The analyzers stay the same. CLI Reference [#cli-reference] ```bash headroom learn [OPTIONS] Options: --project PATH Project directory to analyze (default: current directory) --all Analyze all discovered projects --apply Write recommendations (default: dry-run) --claude-dir PATH Path to .claude directory (default: ~/.claude) ``` Real-World Results [#real-world-results] Tested on 67,583 tool calls across 23 projects: | Metric | Value | | ------------------------ | --------------------- | | Failure rate | 7.5% (5,066 failures) | | Corrections extracted | 164 per project (avg) | | Path corrections | 22 (axion project) | | Search scope corrections | 24 (axion project) | | Command patterns learned | 5 (axion project) | # How Compression Works (/docs/how-compression-works) Headroom automatically detects what kind of content you're sending and routes it to the right compressor. You don't need to configure anything -- just call `compress()` and the pipeline handles the rest. The Three-Stage Pipeline [#the-three-stage-pipeline] Every request flows through three stages: ``` ┌──────────────┐ ┌────────────────┐ ┌─────────────────────┐ │ CacheAligner │────>│ ContentRouter │────>│ IntelligentContext │ │ │ │ │ │ │ │ Stabilize │ │ Detect type & │ │ Score messages & │ │ prefix for │ │ route to best │ │ fit within token │ │ cache hits │ │ compressor │ │ budget │ └──────────────┘ └────────────────┘ └─────────────────────┘ ``` 1. **CacheAligner** extracts dynamic content (dates, user context) from your system prompt so the static prefix stays cacheable across requests. 2. **ContentRouter** inspects each tool output and routes it to the optimal compressor -- SmartCrusher for JSON arrays, CodeAwareCompressor for source code, LogCompressor for build output, and so on. 3. **IntelligentContext** scores every message by importance (recency, semantic relevance, error indicators) and drops the lowest-value messages to fit within the model's context window. Content Type Detection [#content-type-detection] The router auto-detects content type by analyzing structure and patterns. No manual hints required. | Content Type | Detection Signal | Compressor | Typical Savings | | --------------- | ------------------------------------------ | ------------------- | --------------- | | JSON arrays | Valid JSON with array elements | SmartCrusher | 70-90% | | Source code | Syntax patterns, indentation, keywords | CodeAwareCompressor | 40-70% | | Search results | `file:line:content` format | SearchCompressor | 80-95% | | Build/test logs | Timestamps, log levels, pytest/npm markers | LogCompressor | 85-95% | | Diffs | Unified diff format | DiffCompressor | 60-80% | | HTML | Tag structure | HTMLCompressor | 50-70% | | Plain text | Fallback | TextCompressor | 60-80% | Quick Start [#quick-start] ```ts twoslash import { compress } from "headroom-ai"; const messages = [ { role: "system" as const, content: "You are a helpful assistant." }, { role: "user" as const, content: "Summarize this data" }, { role: "tool" as const, content: '{"results": [...]}', tool_call_id: "call_1" }, ]; const result = await compress(messages); console.log(`Tokens saved: ${result.tokensSaved}`); console.log(`Compression ratio: ${result.compressionRatio}`); ``` ```python from headroom.compression import compress result = compress(content) print(result.compressed) print(f"Saved {result.savings_percentage:.0f}% tokens") ``` Configuring the Compressor [#configuring-the-compressor] ```ts twoslash import { compress } from "headroom-ai"; const result = await compress(messages, { model: "gpt-4o", tokenBudget: 50000, }); console.log(`Before: ${result.tokensBefore} tokens`); console.log(`After: ${result.tokensAfter} tokens`); console.log(`Transforms: ${result.transformsApplied.join(", ")}`); ``` ```python from headroom.compression import UniversalCompressor, UniversalCompressorConfig config = UniversalCompressorConfig( compression_ratio_target=0.5, # Keep 50% of content use_entropy_preservation=True, # Preserve UUIDs, hashes use_magika=True, # ML-based content detection ccr_enabled=True, # Store originals for retrieval ) compressor = UniversalCompressor(config=config) result = compressor.compress(content) print(f"Type: {result.content_type}") print(f"Handler: {result.handler_used}") print(f"Saved: {result.savings_percentage:.0f}%") ``` Structure Preservation [#structure-preservation] Headroom doesn't blindly truncate. It identifies what matters in each content type and preserves it: | Content Type | What's Preserved | What's Compressed | | ------------ | ------------------------------------------------------ | ---------------------------------- | | **JSON** | Keys, brackets, booleans, nulls, short values, UUIDs | Long string values, whitespace | | **Code** | Imports, function signatures, class definitions, types | Function bodies, comments | | **Logs** | Timestamps, log levels, error messages, stack traces | Repeated patterns, verbose details | | **Text** | High-entropy tokens (IDs, hashes), headers | Low-information content | Real Compression Ratios [#real-compression-ratios] | Content Type | Compression | Speed | What's Preserved | | -------------------- | ----------- | ------ | -------------------- | | JSON (large arrays) | 70-90% | \~1ms | All keys, structure | | Source code (Python) | 50-70% | \~10ms | Signatures, imports | | Search results | 80-95% | \~2ms | Relevant matches | | Build logs | 85-95% | \~3ms | Errors, stack traces | | Plain text | 60-80% | \~5ms | High-entropy tokens | Batch Compression [#batch-compression] For multiple contents, batch compression is more efficient: ```python from headroom.compression import UniversalCompressor compressor = UniversalCompressor() contents = [ '{"users": [...]}', 'def hello(): pass', 'Plain text content', ] results = compressor.compress_batch(contents) for result in results: print(f"{result.content_type}: {result.savings_percentage:.0f}% saved") ``` What Happens Under the Hood [#what-happens-under-the-hood] When you call `compress()`, here is the full sequence: 1. **Content detection** -- Magika (ML-based) or pattern matching identifies the content type 2. **Structure extraction** -- A handler extracts a structure mask marking what to preserve 3. **Compression** -- Non-structural content is compressed (SmartCrusher, LLMLingua, or text utilities) 4. **CCR storage** -- If enabled, the original is stored for retrieval when the LLM needs full context The pipeline works out of the box with no configuration. All detection, routing, and compression happens automatically. Configuration is available when you need fine-grained control. # Image Compression (/docs/image-compression) Vision models charge by the token, and images are expensive. A single 1024x1024 image costs \~765 tokens on OpenAI. Headroom's image compression uses a trained ML router to analyze your query and automatically select the optimal compression technique, saving 40-90% of image tokens. How It Works [#how-it-works] ``` User uploads image + asks question | [Query Analysis] TrainedRouter (MiniLM from HuggingFace) Classifies: "What animal is this?" -> full_low | [Image Analysis] SigLIP analyzes image properties (has text? complex? fine details?) | [Apply Compression] OpenAI: detail="low" Anthropic: Resize to 512px Google: Resize to 768px | Compressed request to LLM ``` The router is a fine-tuned MiniLM classifier (`chopratejas/technique-router` on HuggingFace) with 93.7% accuracy across 1,157 training examples. Compression Techniques [#compression-techniques] | Technique | Savings | When Used | Example Query | | ----------- | ------- | ----------------------- | -------------------------------------------------- | | `full_low` | \~87% | General understanding | "What is this?", "Describe the scene" | | `preserve` | 0% | Fine details needed | "Count the whiskers", "Read the serial number" | | `crop` | 50-90% | Region-specific queries | "What's in the corner?", "Focus on the background" | | `transcode` | \~99% | Text extraction | "Read the sign", "Transcribe the document" | Quick Start [#quick-start] With Headroom Proxy (Zero Code Changes) [#with-headroom-proxy-zero-code-changes] ```bash # Start the proxy headroom proxy --port 8787 # Connect your client -- images are compressed automatically ANTHROPIC_BASE_URL=http://localhost:8787 claude ``` With HeadroomClient [#with-headroomclient] ```python from headroom import HeadroomClient client = HeadroomClient(provider="openai") response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": [ {"type": "text", "text": "What animal is this?"}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}} ] }] ) # Image automatically compressed with detail="low" (87% savings) ``` Direct API [#direct-api] ```python from headroom.image import ImageCompressor compressor = ImageCompressor() # Compress images in messages compressed_messages = compressor.compress(messages, provider="openai") # Check savings print(f"Saved {compressor.last_savings:.0f}% tokens") print(f"Technique: {compressor.last_result.technique.value}") ``` Provider Support [#provider-support] The compressor adapts its strategy per provider: | Provider | Compression Method | Details | | ----------------- | ------------------- | ------------------------------------------ | | **OpenAI** | Sets `detail="low"` | Native detail parameter | | **Anthropic** | Resizes to 512px | PIL-based resize | | **Google Gemini** | Resizes to 768px | Optimized for Gemini's 768x768 tile system | Token Savings by Provider [#token-savings-by-provider] **OpenAI** (1024x1024 image): | Technique | Before | After | Savings | | ---------- | ---------- | ---------- | ------- | | `full_low` | 765 tokens | 85 tokens | 89% | | `preserve` | 765 tokens | 765 tokens | 0% | **Anthropic** (1024x1024 image): | Before | After | Savings | | -------------- | ------------ | ------- | | \~1,398 tokens | \~349 tokens | 75% | **Google Gemini** (1536x1536 image): | Before | After | Savings | | ---------------------- | ------------------- | ------- | | 1,032 tokens (4 tiles) | 258 tokens (1 tile) | 75% | Configuration [#configuration] ```python from headroom.image import ImageCompressor compressor = ImageCompressor( model_id="chopratejas/technique-router", # HuggingFace model use_siglip=True, # Enable image analysis device="cuda", # Use GPU if available (auto, cuda, cpu, mps) ) ``` Proxy Configuration [#proxy-configuration] ```bash # Enable image compression (default) headroom proxy --image-optimize # Disable image compression headroom proxy --no-image-optimize ``` Performance [#performance] | Metric | Value | | ------------------- | ------------------------------------ | | Router inference | \~10ms (CPU), \~2ms (GPU) | | Image resize | \~5-20ms | | First request | +2-3s (model download, cached after) | | Router accuracy | 93.7% | | Model size | \~128MB | | GPU memory (SigLIP) | \~400MB | When using the Headroom proxy, image compression happens automatically on every request that contains images. No code changes needed. # Introduction (/docs) Headroom compresses everything your AI agent reads -- tool outputs, database results, file reads, RAG retrievals, API responses -- before it reaches the LLM. The model sees less noise, responds faster, and costs less. Quick preview [#quick-preview] ```ts twoslash import { compress } from 'headroom-ai'; const messages = [ { role: 'user' as const, content: 'Analyze these results' }, ]; const result = await compress(messages, { model: 'gpt-4o' }); console.log(`Saved ${result.tokensSaved} tokens (${(result.compressionRatio * 100).toFixed(0)}%)`); ``` ```python from headroom import compress result = compress(messages, model="gpt-4o") response = client.messages.create( model="gpt-4o", messages=result.messages, ) print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})") ``` Community stats [#community-stats] What gets compressed [#what-gets-compressed] | Content type | What happens | Typical savings | | -------------------------- | ------------------------------------------------------------ | --------------- | | JSON arrays (tool outputs) | Statistical analysis keeps errors, anomalies, boundaries | 70--90% | | Source code | AST-aware compression preserves signatures, collapses bodies | 40--70% | | Build/test logs | Keeps failures and errors, drops passing noise | 80--95% | | Search results | Ranks by relevance, keeps top matches | 60--80% | | Plain text | ModernBERT token classification removes redundancy | 30--50% | | Git diffs | Preserves change hunks, drops unchanged context | 40--60% | | Images | ML router selects optimal resize/quality tradeoff | 40--90% | Where Headroom fits [#where-headroom-fits] ``` Your Agent / App | | tool outputs, logs, DB reads, RAG results, file reads, API responses v Headroom <-- proxy, Python library, TS SDK, or framework integration | v LLM Provider (OpenAI, Anthropic, Google, Bedrock, 100+ via LiteLLM) ``` Headroom works as a **transparent proxy** (zero code changes), a **Python function** (`compress()`), a **TypeScript function** (`compress()`), or a **framework integration** (LangChain, Agno, Strands, LiteLLM, Vercel AI SDK, MCP). Real-world results [#real-world-results] **100 production log entries. One critical error buried at position 67.** | Metric | Baseline | Headroom | | --------------- | -------- | -------- | | Input tokens | 10,144 | 1,260 | | Correct answers | **4/4** | **4/4** | 87.6% fewer tokens. Same answer. The FATAL error was automatically preserved -- not by keyword matching, but by statistical analysis of field variance. | Scenario | Before | After | Savings | | ------------------------- | ------ | ------ | ------- | | Code search (100 results) | 17,765 | 1,408 | **92%** | | SRE incident debugging | 65,694 | 5,118 | **92%** | | Codebase exploration | 78,502 | 41,254 | **47%** | | GitHub issue triage | 54,174 | 14,761 | **73%** | Key Features [#key-features] Framework Integrations [#framework-integrations] Nothing is lost [#nothing-is-lost] Compressed content goes into the CCR store (Compress-Cache-Retrieve). The LLM gets a `headroom_retrieve` tool and can fetch full originals when it needs more detail. Compression is aggressive but reversible. Next steps [#next-steps] # Installation (/docs/installation) Python [#python] Headroom requires **Python 3.10+** and is published as `headroom-ai` on PyPI. Core package [#core-package] ```bash pip install headroom-ai ``` The core package includes the `compress()` function, SmartCrusher, CacheAligner, and IntelligentContext. No heavy dependencies. Extras [#extras] Install only what you need, or grab everything with `[all]`: ```bash pip install "headroom-ai[all]" ``` | Extra | What it adds | Install command | | ----------- | ----------------------------------------------------------------------------- | -------------------------------------- | | `proxy` | Proxy server, MCP tools, HTTP API | `pip install "headroom-ai[proxy]"` | | `ml` | Kompress (ModernBERT text compression, requires PyTorch) | `pip install "headroom-ai[ml]"` | | `code` | CodeCompressor (tree-sitter AST parsing) | `pip install "headroom-ai[code]"` | | `mcp` | MCP server tools (`headroom_compress`, `headroom_retrieve`, `headroom_stats`) | `pip install "headroom-ai[mcp]"` | | `langchain` | LangChain `HeadroomChatModel` wrapper | `pip install "headroom-ai[langchain]"` | | `agno` | Agno `HeadroomAgnoModel` wrapper | `pip install "headroom-ai[agno]"` | | `evals` | Evaluation framework (GSM8K, SQuAD, BFCL benchmarks) | `pip install "headroom-ai[evals]"` | | `all` | Everything above | `pip install "headroom-ai[all]"` | You can combine extras: ```bash pip install "headroom-ai[proxy,langchain,ml]" ``` Verify the install [#verify-the-install] ```bash python -c "import headroom; print(headroom.__version__)" ``` TypeScript / Node.js [#typescript--nodejs] The TypeScript SDK is published as `headroom-ai` on npm. It requires **Node.js 18+**. ```bash npm install headroom-ai ``` Or with other package managers: ```bash pnpm add headroom-ai yarn add headroom-ai ``` The TypeScript SDK sends messages to the Headroom proxy over HTTP for compression. The proxy runs the full compression pipeline (Python). Start it before using the SDK: ```bash pip install "headroom-ai[proxy]" headroom proxy --port 8787 ``` Then point the SDK at it: ```ts import { compress } from 'headroom-ai'; const result = await compress(messages, { baseUrl: 'http://localhost:8787', }); ``` Verify the install [#verify-the-install-1] ```bash node -e "const h = require('headroom-ai'); console.log('headroom-ai loaded')" ``` Docker [#docker] Pre-built images are published to GitHub Container Registry on every release. ```bash docker pull ghcr.io/chopratejas/headroom:latest docker run -p 8787:8787 ghcr.io/chopratejas/headroom:latest ``` Image tags [#image-tags] | Tag | Extras | Base image | Description | | ------------------- | ------------ | ----------- | ----------------------------------------- | | `latest` | `proxy` | Debian slim | Default image, runs the proxy | | `` | `proxy` | Debian slim | Pinned version | | `nonroot` | `proxy` | Debian slim | Runs as non-root user | | `code` | `proxy,code` | Debian slim | Includes tree-sitter for code compression | | `code-nonroot` | `proxy,code` | Debian slim | Code compression, non-root | | `slim` | `proxy` | Distroless | Minimal image, no shell | | `slim-nonroot` | `proxy` | Distroless | Minimal, non-root | | `code-slim` | `proxy,code` | Distroless | Code compression, minimal | | `code-slim-nonroot` | `proxy,code` | Distroless | Code compression, minimal, non-root | Build from source [#build-from-source] Use Docker Bake for multi-variant builds: ```bash # List all targets docker buildx bake --list targets # Build the default runtime image docker buildx bake runtime-default # Build a specific variant with custom registry docker buildx bake runtime-code-slim-nonroot \ --set '*.tags=my-registry/headroom:code-slim-nonroot' ``` Environment variables [#environment-variables] These variables configure Headroom at runtime. Set them in your shell, `.env` file, or container environment. LLM provider keys [#llm-provider-keys] | Variable | Description | | --------------------------------------------- | --------------------------------------------------- | | `OPENAI_API_KEY` | OpenAI API key (used when proxying to OpenAI) | | `ANTHROPIC_API_KEY` | Anthropic API key (used when proxying to Anthropic) | | `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` | AWS credentials for Bedrock backend | | `GOOGLE_APPLICATION_CREDENTIALS` | Google Cloud credentials for Vertex AI backend | Proxy configuration [#proxy-configuration] | Variable | Default | Description | | -------------------- | ---------- | --------------------------------------------------- | | `HEADROOM_PORT` | `8787` | Port the proxy listens on | | `HEADROOM_HOST` | `0.0.0.0` | Host the proxy binds to | | `HEADROOM_MODE` | `optimize` | Default mode: `optimize`, `audit`, or `passthrough` | | `HEADROOM_LOG_LEVEL` | `INFO` | Logging level | TypeScript SDK [#typescript-sdk] | Variable | Default | Description | | ------------------- | ----------------------- | ---------------------------------- | | `HEADROOM_BASE_URL` | `http://localhost:8787` | Proxy URL for the TypeScript SDK | | `HEADROOM_API_KEY` | *(none)* | API key if the proxy requires auth | Next steps [#next-steps] # LangChain (/docs/langchain) Headroom integrates with LangChain to compress context across all LangChain patterns: chat models, memory, retrievers, agents, and streaming. Installation [#installation] ```bash pip install "headroom-ai[langchain]" ``` Quick start [#quick-start] Wrap any chat model in one line: ```python from langchain_openai import ChatOpenAI from headroom.integrations import HeadroomChatModel llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o")) # Use exactly like before response = llm.invoke("Hello!") # Check savings print(llm.get_metrics()) # {'tokens_saved': 12500, 'savings_percent': 45.2, 'requests': 50} ``` Works with any provider: ```python from langchain_anthropic import ChatAnthropic llm = HeadroomChatModel(ChatAnthropic(model="claude-sonnet-4-20250514")) ``` Memory integration [#memory-integration] `HeadroomChatMessageHistory` wraps any chat history with automatic compression. Long conversations stay under your token budget: ```python from langchain.memory import ConversationBufferMemory from langchain_community.chat_message_histories import ChatMessageHistory from headroom.integrations import HeadroomChatMessageHistory base_history = ChatMessageHistory() compressed_history = HeadroomChatMessageHistory( base_history, compress_threshold_tokens=4000, # Compress when over 4K tokens keep_recent_turns=5, # Always keep last 5 turns ) memory = ConversationBufferMemory(chat_memory=compressed_history) ``` After usage: ```python print(compressed_history.get_compression_stats()) # {'compression_count': 12, 'total_tokens_saved': 28000} ``` Retriever integration [#retriever-integration] `HeadroomDocumentCompressor` filters retrieved documents by relevance. Retrieve many for recall, keep the best for precision: ```python from langchain.retrievers import ContextualCompressionRetriever from langchain_community.vectorstores import FAISS from headroom.integrations import HeadroomDocumentCompressor base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50}) compressor = HeadroomDocumentCompressor( max_documents=10, min_relevance=0.3, prefer_diverse=True, # MMR-style diversity ) retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=base_retriever, ) # Retrieves 50 docs, returns best 10 docs = retriever.invoke("What is Python?") ``` Agent tool wrapping [#agent-tool-wrapping] `wrap_tools_with_headroom` compresses tool outputs before they re-enter the agent's context: ```python from langchain_core.tools import tool from headroom.integrations import wrap_tools_with_headroom @tool def search_database(query: str) -> str: """Search the database.""" return json.dumps({"results": [...], "total": 1000}) wrapped_tools = wrap_tools_with_headroom( [search_database], min_chars_to_compress=1000, ) agent = create_openai_tools_agent(llm, wrapped_tools, prompt) executor = AgentExecutor(agent=agent, tools=wrapped_tools) ``` Per-tool metrics: ```python from headroom.integrations import get_tool_metrics metrics = get_tool_metrics() print(metrics.get_summary()) # {'total_invocations': 25, 'total_compressions': 18, 'total_chars_saved': 450000} ``` LangGraph ReAct agent [#langgraph-react-agent] ```python from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent from headroom.integrations import HeadroomChatModel, wrap_tools_with_headroom llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o")) tools = wrap_tools_with_headroom([search_web, query_database]) agent = create_react_agent(llm, tools) result = agent.invoke({ "messages": [("user", "Find users who signed up last week")] }) ``` LangGraph custom graph [#langgraph-custom-graph] Insert a compression node between tools and the agent in a custom `StateGraph`: ```python from langgraph.graph import StateGraph, MessagesState, START, END from headroom.integrations.langchain import create_compress_tool_messages_node graph = StateGraph(MessagesState) graph.add_node("agent", agent_node) graph.add_node("tools", tools_node) graph.add_node("compress", create_compress_tool_messages_node( min_tokens_to_compress=100, )) # Wire: tools -> compress -> agent graph.add_edge(START, "agent") graph.add_edge("tools", "compress") graph.add_edge("compress", "agent") ``` Streaming [#streaming] Full async support: ```python # Async invoke response = await llm.ainvoke("Hello!") # Async streaming async for chunk in llm.astream("Tell me a story"): print(chunk.content, end="", flush=True) ``` Custom configuration [#custom-configuration] ```python from headroom import HeadroomConfig, HeadroomMode config = HeadroomConfig( default_mode=HeadroomMode.OPTIMIZE, smart_crusher_target_ratio=0.3, ) llm = HeadroomChatModel( ChatOpenAI(model="gpt-4o"), headroom_config=config, ) ``` # Limitations (/docs/limitations) Headroom is designed to compress LLM context without losing accuracy. This page documents when it helps, when it does not, and the safety gates that prevent harmful compression. When Headroom Helps vs. Does Not [#when-headroom-helps-vs-does-not] | Content Type | Compression | Latency Impact | Best For | | ------------------------------------------------------------------ | ----------- | -------------------------------- | --------------------------- | | **JSON: Arrays of dicts** (search results, API responses, DB rows) | 86--100% | Net latency win on Sonnet/Opus | Primary use case | | **JSON: Arrays of strings** (file paths, log lines, tags) | 60--90% | Net latency win | String dedup + sampling | | **JSON: Arrays of numbers** (metrics, time series) | 70--85% | Net latency win | Statistical summary | | **JSON: Mixed-type arrays** | 50--70% | Net latency win | Group-by-type compression | | **Structured logs** (as JSON) | 82--95% | Net latency win | Log entries in tool outputs | | **Agentic conversations** (25--50 turns) | 56--81% | Break-even to net win | Multi-tool agent sessions | | **Plain text** (documentation, articles) | 43--46% | Adds latency (cost savings only) | Cost optimization | | **Code** | Passthrough | Minimal overhead | See below | | **RAG document contexts** | Passthrough | Minimal overhead | Not compressed | Where Headroom Adds the Most Value [#where-headroom-adds-the-most-value] * Long agent sessions with accumulated tool outputs (40--80% compression) * JSON-heavy workflows -- API responses, database queries (83--94% compression) * Build and test output (85--94% compression) * Multi-tool agents (60--76% compression across tool results) Where Headroom Adds Little Value [#where-headroom-adds-little-value] * Short conversational exchanges (median 4.8% compression) * Code-only sessions (reading/writing files) -- code passes through * Single-turn requests with no accumulated context What Headroom Does NOT Compress [#what-headroom-does-not-compress] * **Short messages** (\< 300 tokens) -- overhead exceeds savings * **Source code** -- passes through unchanged to preserve correctness * **grep/search results** -- compact structured format, already minimal * **Images** -- counted at fixed token cost (\~1,600 tokens), not compressed * **System prompts** -- preserved for prefix cache compatibility Code Compression [#code-compression] Headroom includes an AST-aware CodeCompressor (tree-sitter, 8 languages) but it is gated behind safety protections that prevent it from firing in most real-world scenarios. This is intentional. **Why code mostly passes through:** 1. **Word count gate**: Content under 50 words is silently skipped 2. **Recent code protection** (`protect_recent_code=4`): Code in the last 4 messages is never compressed 3. **Analysis intent protection** (`protect_analysis_context=True`): If the most recent user message contains keywords like "analyze", "review", "explain", "fix", "debug" -- ALL code in the conversation is protected **Why this is the right default**: Code is almost always fetched because the user wants to work with it. Compressing function bodies would remove exactly what they need. **Where code savings come from**: The IntelligentContextManager drops old code messages that are no longer relevant (scoring-based), which is a better strategy than stripping function bodies. **Override**: Set `protect_analysis_context=False` in `ContentRouterConfig` for aggressive code compression. Requires `headroom-ai[code]` for tree-sitter. JSON Compression Constraints [#json-compression-constraints] What Gets Compressed [#what-gets-compressed] * Arrays of **dicts**: Full statistical analysis with adaptive K (Kneedle algorithm) * Arrays of **strings**: Dedup + adaptive sampling + error preservation * Arrays of **numbers**: Statistical summary + outlier/change-point preservation * **Mixed-type** arrays: Grouped by type, each group compressed independently * **Nested** objects: Recursed into, arrays within are compressed (up to depth 5) What Passes Through [#what-passes-through] * Arrays below 5 items (`min_items_to_analyze`) * Content below 200 tokens (`min_tokens_to_crush`) * Bool-only arrays * JSON objects without array values * Malformed JSON (silently passes through, no error) Edge Cases [#edge-cases] * **NaN/Infinity** in numeric fields: Filtered out before statistics are computed * **Nesting depth > 5**: Inner arrays not examined for compression * **Mixed-type arrays with small groups**: Groups below `min_items_to_analyze` are kept as-is Safety Gates [#safety-gates] All compressors follow the same principle: **fail gracefully, return original content unchanged**. * Invalid JSON passes through (no error raised) * AST parse failure falls back to original or LLMLingua * Compression that makes output larger returns the original * Missing optional dependencies (tree-sitter, LLMLingua) cause a passthrough with warning log * Errors are logged at WARNING level and never propagated to callers LLMLingua out-of-memory during model loading raises a `RuntimeError`. All other failures are silently handled. Adaptive K: How Item Retention Works [#adaptive-k-how-item-retention-works] SmartCrusher does not use fixed K values. It uses information-theoretic sizing: 1. **Kneedle algorithm** on bigram coverage curves finds the point where adding more items stops providing new information 2. **SimHash** fingerprinting detects near-duplicate items 3. **zlib validation** ensures the subset captures the full set's diversity The resulting K is split: 30% from array start, 15% from end, 55% for importance-scored items. **Safety guarantees (additive, never dropped):** * Error items (containing "error", "exception", "failed", "critical") -- across ALL array types * Numeric anomalies (> 2 standard deviations from mean) * String length anomalies (> 2 standard deviations from mean length) * Change points (sudden shifts in running values) These are kept even if they exceed the K budget. Configuration Tuning [#configuration-tuning] | Parameter | Default | Effect | | --------------------------- | ------- | ------------------------------------------------------- | | `min_items_to_analyze` | 5 | Arrays below this pass through | | `min_tokens_to_crush` | 200 | Content below this passes through | | `max_items_after_crush` | 15 | Upper bound on retained items | | `variance_threshold` | 2.0 | Std devs for anomaly detection (lower = more preserved) | | `protect_analysis_context` | True | Protect code when user asks about it | | `protect_recent_code` | 4 | Messages from end to protect code in | | `skip_user_messages` | True | Never compress user messages | | `toin_confidence_threshold` | 0.3 | Minimum TOIN confidence to apply hints | Provider Interactions [#provider-interactions] * CacheAligner maximizes Anthropic/OpenAI prefix cache hit rates * Token counting uses model-specific tokenizers (tiktoken for OpenAI, calibrated estimation for Anthropic) * Compression works with all providers -- no provider-specific limitations * Compressed content is valid JSON -- downstream tools and parsers work unchanged TOIN Cold Start [#toin-cold-start] The Tool Output Intelligence Network (TOIN) learns compression patterns from usage. For new tool types: * No learned patterns exist -- falls back to statistical heuristics * Confidence below `toin_confidence_threshold` (default 0.3) -- TOIN hints ignored * Patterns build up over time as tools are used repeatedly * Cross-session learning requires persistence (`TelemetryConfig.storage_path`) # LiteLLM (/docs/litellm) Headroom integrates with [LiteLLM](https://github.com/BerriAI/litellm) as a callback that compresses messages before they reach any provider. One line to enable, works with all 100+ LiteLLM-supported providers. Installation [#installation] ```bash pip install headroom-ai litellm ``` Quick start [#quick-start] ```python import litellm from headroom.integrations.litellm_callback import HeadroomCallback litellm.callbacks = [HeadroomCallback()] # All calls now compressed automatically response = litellm.completion(model="gpt-4o", messages=[...]) response = litellm.completion(model="bedrock/claude-sonnet", messages=[...]) response = litellm.completion(model="azure/gpt-4o", messages=[...]) ``` The callback compresses messages in LiteLLM's `pre_call_hook` before they reach the provider. How it works [#how-it-works] 1. You call `litellm.completion()` with your messages 2. `HeadroomCallback.pre_call_hook` compresses the messages 3. LiteLLM sends the compressed messages to the provider 4. The response comes back unchanged This works with every provider LiteLLM supports: OpenAI, Anthropic, Bedrock, Azure, Vertex AI, Cohere, Groq, Mistral, Together, Ollama, and more. With LiteLLM Proxy [#with-litellm-proxy] If you run LiteLLM as a proxy server, use the ASGI middleware: ```python from litellm.proxy.proxy_server import app from headroom.integrations.asgi import CompressionMiddleware app.add_middleware(CompressionMiddleware) ``` Or configure via YAML: ```yaml # litellm_config.yaml litellm_settings: callbacks: ["headroom.integrations.litellm_callback.HeadroomCallback"] ``` Direct compress() with LiteLLM [#direct-compress-with-litellm] You can also use `compress()` directly instead of the callback: ```python import litellm from headroom import compress messages = [{"role": "user", "content": large_content}] compressed = compress(messages, model="bedrock/claude-sonnet") response = litellm.completion( model="bedrock/claude-sonnet", messages=compressed.messages, ) print(f"Saved {compressed.tokens_saved} tokens") ``` ASGI middleware [#asgi-middleware] Drop-in middleware for any ASGI application. Intercepts `/v1/messages`, `/v1/chat/completions`, `/v1/responses`, and `/chat/completions`: ```python from fastapi import FastAPI from headroom.integrations.asgi import CompressionMiddleware app = FastAPI() app.add_middleware(CompressionMiddleware) ``` Response headers include `x-headroom-compressed: true` and `x-headroom-tokens-saved: 1234`. # MCP Tools (/docs/mcp) Headroom's MCP server exposes compression, retrieval, and observability as tools that any MCP-compatible AI coding tool can call -- Claude Code, Cursor, Codex, and more. No proxy required. Installation [#installation] ```bash # MCP tools only (lightweight) pip install "headroom-ai[mcp]" # Or with the proxy pip install "headroom-ai[proxy]" ``` Setup for Claude Code [#setup-for-claude-code] ```bash # Register with Claude Code (one-time) headroom mcp install # Start Claude Code — it now has headroom tools claude ``` Claude Code can now compress content on demand, retrieve originals, and check session stats. For automatic compression of **all** traffic, also run the proxy: ```bash # Terminal 1 headroom proxy # Terminal 2 ANTHROPIC_BASE_URL=http://127.0.0.1:8787 claude ``` Tools [#tools] headroom_compress [#headroom_compress] Compress content on demand. The LLM calls this when it wants to shrink large content before reasoning over it. **Parameters:** * `content` (required) -- text to compress (files, JSON, logs, search results) **Returns:** * `compressed` -- compressed text * `hash` -- key for retrieving the original later * `original_tokens` / `compressed_tokens` / `savings_percent` * `transforms` -- which compression algorithms were applied Example flow: ``` Claude: Let me compress this large output to save context space. -> headroom_compress(content="[5000 lines of grep results...]") <- { "compressed": "[key matches with context...]", "hash": "a1b2c3d4e5f6...", "original_tokens": 12000, "compressed_tokens": 3200, "savings_percent": 73.3, "transforms": ["router:search:0.27"] } ``` The original is stored locally for 1 hour. If the LLM needs the full content later, it calls `headroom_retrieve`. headroom_retrieve [#headroom_retrieve] Retrieve original uncompressed content by hash. **Parameters:** * `hash` (required) -- hash key from a previous compression * `query` (optional) -- search within the original to return only matching items **Returns:** * `original_content` (full retrieval) or `results` (filtered search) * `source` -- `"local"` or `"proxy"` Retrieval checks the local store first, then falls back to the proxy's store. Hashes from either source work transparently. headroom_stats [#headroom_stats] Session compression statistics. **Returns:** * `compressions`, `retrievals`, `tokens_saved`, `savings_percent` * `estimated_cost_saved_usd` * `recent_events` -- last 10 compression/retrieval events * `sub_agents` -- stats from sub-agent MCP instances * `combined` -- main + sub-agent totals * `proxy` -- request count, cache hits, cost saved (if proxy is running) Sub-agent stats are aggregated via a shared stats file at `~/.headroom/session_stats.jsonl`. Streamable HTTP Transport (Remote / Docker) [#streamable-http-transport-remote--docker] For agents running on a different machine than the Headroom proxy (e.g., Docker, cloud), MCP tools are available over HTTP using the MCP Streamable HTTP protocol. Proxy auto-exposes /mcp [#proxy-auto-exposes-mcp] When you run `headroom proxy`, MCP tools are automatically available at `/mcp`: ```bash headroom proxy # → http://host:8787/mcp ``` Remote agents connect with: ```json { "mcpServers": { "headroom": { "url": "http://proxy-host:8787/mcp" } } } ``` Standalone HTTP server [#standalone-http-server] Run MCP tools without the full proxy: ```bash headroom mcp serve --transport http --port 8080 ``` Remote install [#remote-install] Configure Claude Code to use remote MCP over HTTP: ```bash headroom mcp install --remote http://proxy-host:8787/mcp ``` This writes URL-based config instead of the default command-based config. Protocol [#protocol] The Streamable HTTP transport implements the MCP specification: * `POST /mcp` -- Send JSON-RPC requests (tool calls, list tools) * `GET /mcp` -- Server-sent events stream (server-initiated messages) * `DELETE /mcp` -- Terminate session Stateless mode by default -- each request is independent, no session tracking needed. CLI commands [#cli-commands] ```bash # Install — local (stdio, default) headroom mcp install # Install — remote (HTTP, for Docker/network) headroom mcp install --remote http://proxy-host:8787/mcp # Install — custom proxy URL headroom mcp install --proxy-url http://host:9000 # Overwrite existing config headroom mcp install --force # Serve — stdio (default, called by Claude Code) headroom mcp serve # Serve — HTTP (for remote agents) headroom mcp serve --transport http --port 8080 # Serve — debug mode headroom mcp serve --debug # Check status headroom mcp status # Uninstall headroom mcp uninstall ``` Cross-tool compatibility [#cross-tool-compatibility] | Tool | Transport | Setup | | -------------------- | ------------ | ---------------------------------------------------- | | Claude Code (local) | stdio | `headroom mcp install` | | Claude Code (remote) | HTTP | `headroom mcp install --remote http://host:8787/mcp` | | Cursor | stdio / HTTP | Add to MCP settings | | Docker agents | HTTP | Point to `http://proxy:8787/mcp` | | Any MCP host | stdio / HTTP | `headroom mcp serve` or `--transport http` | Architecture [#architecture] MCP only (no proxy) [#mcp-only-no-proxy] The LLM calls `headroom_compress` on demand. Compression happens locally in the MCP process. Originals are stored in a local `CompressionStore` with 1-hour TTL. MCP + Proxy (full setup) [#mcp--proxy-full-setup] The proxy compresses all traffic at the HTTP level (before the LLM sees content). MCP tools operate after the LLM receives content. They handle different data and do not double-compress. `headroom_retrieve` checks the local store first, then falls back to the proxy's store. Remote (HTTP transport) [#remote-http-transport] ``` Remote Agent (any machine) | | POST /mcp (JSON-RPC) v Headroom Proxy :8787/mcp (Streamable HTTP) | | in-process access v Compression Pipeline + CompressionStore ``` The proxy's `/mcp` endpoint shares the same compression store and pipeline as the proxy itself -- no HTTP round-trips to self. Troubleshooting [#troubleshooting] **"MCP SDK not installed"** -- Run `pip install "headroom-ai[mcp]"`. **"Proxy not running"** -- Start the proxy with `headroom proxy` in another terminal. Only needed for proxy-backed retrieval. **"Entry not found or expired"** -- Local content expires after 1 hour, proxy content after 5 minutes. **Claude doesn't see headroom tools** -- Run `headroom mcp status`, restart Claude Code, and verify with `/mcp` inside Claude Code. # Persistent Memory (/docs/memory) LLMs have two fundamental limitations: context windows overflow with too much history, and every conversation starts from zero. Persistent Memory solves both by extracting key facts, persisting them, and injecting them when relevant. This is **temporal compression** -- instead of carrying 10,000 tokens of conversation history, carry 100 tokens of extracted memories. Quick Start [#quick-start] ```python from openai import OpenAI from headroom import with_memory # One line -- that's it client = with_memory(OpenAI(), user_id="alice") # Use exactly like normal response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "I prefer Python for backend work"}] ) # Memory extracted INLINE -- zero extra latency # Later, in a new conversation... response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What language should I use?"}] ) # Response uses the Python preference from memory ``` How It Works [#how-it-works] The `with_memory()` wrapper intercepts every chat completion call: 1. **Inject** -- Semantic search finds relevant memories and prepends them to the user message 2. **Instruct** -- Adds a memory extraction instruction to the system prompt 3. **Call** -- Forwards the request to the LLM 4. **Parse** -- Extracts the `` block from the response 5. **Store** -- Saves with embeddings, vector index, and full-text search index 6. **Return** -- Cleans the response (strips the memory block before returning) Memory extraction happens **inline** as part of the LLM response. No extra API calls, no extra latency. Hierarchical Scoping [#hierarchical-scoping] Memories exist at four scope levels, from broadest to narrowest: | Scope | Persists Across | Use Case | | ----------- | ------------------------ | ------------------------------- | | **User** | All sessions, all time | Long-term preferences, identity | | **Session** | Current session only | Current task context | | **Agent** | Current agent in session | Agent-specific context | | **Turn** | Single turn only | Ephemeral working memory | ```python from openai import OpenAI from headroom import with_memory # Session 1: Morning client1 = with_memory( OpenAI(), user_id="bob", session_id="morning-session", ) response = client1.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "I prefer Go for performance-critical code"}] ) # Memory stored at USER level (persists across sessions) # Session 2: Afternoon (different session, same user) client2 = with_memory( OpenAI(), user_id="bob", session_id="afternoon-session", ) response = client2.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What language for my new microservice?"}] ) # Recalls Go preference from morning session ``` Memory Categories [#memory-categories] Memories are categorized for better organization and retrieval: | Category | Description | Examples | | ------------ | ------------------------------------- | ------------------------------------------------- | | `PREFERENCE` | Likes, dislikes, preferred approaches | "Prefers Python", "Likes dark mode" | | `FACT` | Identity, role, constraints | "Works at fintech startup", "Senior engineer" | | `CONTEXT` | Current goals, ongoing tasks | "Migrating to microservices", "Working on auth" | | `ENTITY` | Information about entities | "Project Apollo uses React", "Team lead is Sarah" | | `DECISION` | Decisions made | "Chose PostgreSQL over MySQL" | | `INSIGHT` | Derived insights | "User tends to prefer typed languages" | Memory API [#memory-api] The `with_memory()` wrapper exposes a `.memory` attribute for direct access: ```python client = with_memory(OpenAI(), user_id="alice") # Search memories (semantic) results = client.memory.search("python preferences", top_k=5) for memory in results: print(f"{memory.content}") # Add a memory manually client.memory.add( "User is a senior engineer", category="fact", importance=0.9, ) # Get all memories for this user all_memories = client.memory.get_all() # Clear all memories client.memory.clear() # Get stats stats = client.memory.stats() print(f"Total memories: {stats['total']}") print(f"By category: {stats['categories']}") ``` Temporal Versioning [#temporal-versioning] When facts change, Headroom creates a **supersession chain** that preserves history: ```python from headroom.memory import HierarchicalMemory, MemoryCategory memory = await HierarchicalMemory.create() # Original fact orig = await memory.add( content="User works at Google", user_id="alice", category=MemoryCategory.FACT, ) # User changes jobs -- supersede the old memory new = await memory.supersede( old_memory_id=orig.id, new_content="User now works at Anthropic", ) # Query current state (excludes superseded by default) current = await memory.query(MemoryFilter( user_id="alice", include_superseded=False, )) # Returns only "User now works at Anthropic" # Get the full chain chain = await memory.get_history(new.id) # [ # Memory(content="User works at Google", is_current=False), # Memory(content="User now works at Anthropic", is_current=True), # ] ``` This gives you an audit trail, the ability to debug why the LLM made certain decisions, and rollback if needed. Backends [#backends] Embedder Backends [#embedder-backends] ```python from headroom.memory import MemoryConfig, EmbedderBackend # Local embeddings (recommended -- fast, free, private) config = MemoryConfig( embedder_backend=EmbedderBackend.LOCAL, embedder_model="all-MiniLM-L6-v2", ) # OpenAI embeddings (higher quality, costs money) config = MemoryConfig( embedder_backend=EmbedderBackend.OPENAI, openai_api_key="sk-...", embedder_model="text-embedding-3-small", ) # Ollama embeddings (local server, many models) config = MemoryConfig( embedder_backend=EmbedderBackend.OLLAMA, ollama_base_url="http://localhost:11434", embedder_model="nomic-embed-text", ) ``` Storage [#storage] Storage uses **SQLite** for CRUD and filtering, **HNSW** for vector similarity search, and **FTS5** for full-text keyword search. All embedded -- no external services required. ```python config = MemoryConfig( db_path="memory.db", vector_dimension=384, hnsw_ef_construction=200, hnsw_m=16, hnsw_ef_search=50, cache_enabled=True, cache_max_size=1000, ) ``` Provider Compatibility [#provider-compatibility] Memory works with any OpenAI-compatible client: ```python from openai import OpenAI from headroom import with_memory # OpenAI client = with_memory(OpenAI(), user_id="alice") # Azure OpenAI client = with_memory( OpenAI(base_url="https://your-resource.openai.azure.com/..."), user_id="alice", ) # Groq from groq import Groq client = with_memory(Groq(), user_id="alice") ``` Performance [#performance] | Operation | Latency | Notes | | ----------------- | -------------- | ------------------------------ | | Memory injection | \<50ms | Local embeddings + HNSW search | | Memory extraction | +50-100 tokens | Part of LLM response (inline) | | Memory storage | \<10ms | SQLite + HNSW + FTS5 indexing | | Cache hit | \<1ms | LRU cache lookup | # Metrics & Monitoring (/docs/metrics) Headroom provides comprehensive metrics for monitoring compression performance, cost savings, and system health through both the proxy server and the SDK. Proxy Endpoints [#proxy-endpoints] Stats Endpoint [#stats-endpoint] ```bash curl http://localhost:8787/stats ``` ```json { "persistent_savings": { "lifetime": { "tokens_saved": 12500, "compression_savings_usd": 0.04 } }, "requests": { "total": 42, "cached": 5, "rate_limited": 0, "failed": 0 }, "tokens": { "input": 50000, "output": 8000, "saved": 12500, "savings_percent": 25.0 }, "cost": { "total_cost_usd": 0.15, "total_savings_usd": 0.04 }, "cache": { "entries": 10, "total_hits": 5 } } ``` Persistent savings are stored at `~/.headroom/proxy_savings.json` and survive proxy restarts. Override the path with `HEADROOM_SAVINGS_PATH`. Historical Savings [#historical-savings] ```bash curl http://localhost:8787/stats-history ``` Returns durable compression history with hourly, daily, weekly, and monthly rollups. Supports CSV export: ```bash curl "http://localhost:8787/stats-history?format=csv&series=daily" curl "http://localhost:8787/stats-history?format=csv&series=monthly" ``` Prometheus Metrics [#prometheus-metrics] ```bash curl http://localhost:8787/metrics ``` ``` # HELP headroom_requests_total Total requests processed headroom_requests_total{mode="optimize"} 1234 # HELP headroom_tokens_saved_total Total tokens saved headroom_tokens_saved_total 5678900 # HELP headroom_compression_ratio Compression ratio histogram headroom_compression_ratio_bucket{le="0.5"} 890 headroom_compression_ratio_bucket{le="0.7"} 1100 headroom_compression_ratio_bucket{le="0.9"} 1200 # HELP headroom_latency_seconds Request latency histogram headroom_latency_seconds_bucket{le="0.01"} 800 headroom_latency_seconds_bucket{le="0.1"} 1150 # HELP headroom_cache_hits_total Cache hit counter headroom_cache_hits_total 456 ``` Health Check [#health-check] ```bash curl http://localhost:8787/health ``` ```json { "status": "healthy", "version": "0.1.0", "uptime_seconds": 3600, "llmlingua_enabled": false } ``` SDK Metrics [#sdk-metrics] Proxy Stats [#proxy-stats] The TypeScript SDK queries the proxy for stats: ```ts twoslash import { HeadroomClient } from 'headroom-ai'; const client = new HeadroomClient(); // Get proxy stats const stats = await client.proxyStats(); console.log(`Tokens saved: ${stats.tokens.saved}`); console.log(`Savings: ${stats.tokens.savings_percent}%`); ``` Compression Result Metrics [#compression-result-metrics] Every `compress()` call returns metrics: ```ts twoslash import { compress } from 'headroom-ai'; const result = await compress(messages, { model: 'gpt-4o' }); console.log(`Tokens: ${result.tokensBefore} -> ${result.tokensAfter}`); console.log(`Saved: ${result.tokensSaved} (${(result.compressionRatio * 100).toFixed(1)}%)`); console.log(`Transforms: ${result.transformsApplied.join(', ')}`); ``` Session Stats [#session-stats] Quick stats for the current session (no database query): ```python stats = client.get_stats() print(f"Mode: {stats['config']['mode']}") print(f"Tokens saved: {stats['session']['tokens_saved_total']}") print(f"Avg compression: {stats['session']['compression_ratio_avg']:.1%}") ``` Returns: ```python { "session": { "requests_total": 10, "tokens_input_before": 50000, "tokens_input_after": 35000, "tokens_saved_total": 15000, "tokens_output_total": 8000, "cache_hits": 3, "compression_ratio_avg": 0.70, }, "config": { "mode": "optimize", "provider": "openai", "cache_optimizer_enabled": True, "semantic_cache_enabled": False, }, "transforms": { "smart_crusher_enabled": True, "cache_aligner_enabled": True, "rolling_window_enabled": True, }, } ``` Historical Metrics [#historical-metrics] Query stored metrics from the database: ```python from datetime import datetime, timedelta metrics = client.get_metrics( start_time=datetime.utcnow() - timedelta(hours=1), limit=100, ) for m in metrics: print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}") ``` Summary Statistics [#summary-statistics] Aggregate statistics across all stored metrics: ```python summary = client.get_summary() print(f"Total requests: {summary['total_requests']}") print(f"Total tokens saved: {summary['total_tokens_saved']}") print(f"Average compression: {summary['avg_compression_ratio']:.1%}") print(f"Total cost savings: ${summary['total_cost_saved_usd']:.2f}") ``` Logging [#logging] ```python import logging # INFO level shows compression summaries logging.basicConfig(level=logging.INFO) # DEBUG level shows detailed transform decisions logging.basicConfig(level=logging.DEBUG) ``` Example output: ``` INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction) INFO:headroom.transforms.smart_crusher:SmartCrusher applied top_n strategy: kept 15 of 1000 items DEBUG:headroom.transforms.smart_crusher:Kept items: [0,1,2,42,77,97,98,99] (errors at 42, warnings at 77) ``` ```bash # Log to file headroom proxy --log-file headroom.jsonl # Increase verbosity headroom proxy --log-level debug ``` Cost Tracking [#cost-tracking] Budget Alerts [#budget-alerts] Set a budget limit in the proxy: ```bash headroom proxy --budget 10.00 ``` When the budget is exceeded, requests return a budget exceeded error, the `/stats` endpoint shows budget status, and logs indicate the budget state. Key Metrics to Monitor [#key-metrics-to-monitor] | Metric | What It Tells You | Target | | ----------------------- | ------------------- | ---------------- | | `tokens_saved_total` | Total cost savings | Higher is better | | `compression_ratio_avg` | Efficiency | 0.7--0.9 typical | | `cache_hit_rate` | Cache effectiveness | >20% is good | | `latency_p99` | Performance impact | \<10ms | | `failed_requests` | Reliability | 0 | Grafana Dashboard [#grafana-dashboard] Example Prometheus queries for a Grafana dashboard: | Panel | PromQL | | -------------------------- | --------------------------------------------------------------------------------------- | | Tokens Saved | `headroom_tokens_saved_total` | | Compression Ratio (median) | `histogram_quantile(0.5, headroom_compression_ratio_bucket)` | | Request Latency (p99) | `histogram_quantile(0.99, headroom_latency_seconds_bucket)` | | Cache Hit Rate | `headroom_cache_hits_total / (headroom_cache_hits_total + headroom_cache_misses_total)` | # OpenAI SDK (/docs/openai-sdk) Headroom wraps the OpenAI Node.js SDK to automatically compress messages before every `chat.completions.create()` call. All other methods (embeddings, images, audio) pass through unchanged. Installation [#installation] ```bash npm install headroom-ai openai ``` The TypeScript SDK sends messages to a local Headroom proxy for compression. Start the proxy before using the SDK: ```bash pip install "headroom-ai[proxy]" headroom proxy ``` Quick start [#quick-start] ```ts twoslash import { withHeadroom } from 'headroom-ai/openai'; import OpenAI from 'openai'; const client = withHeadroom(new OpenAI()); // Messages are compressed automatically before sending const response = await client.chat.completions.create({ model: 'gpt-4o', messages: longConversation, }); ``` That's it. Every call to `client.chat.completions.create()` compresses the messages first. The response format is identical to the unwrapped client. How it works [#how-it-works] `withHeadroom()` returns a proxy around your OpenAI client that intercepts `chat.completions.create()`: 1. Extracts `messages` from the request params 2. Sends them to the Headroom proxy's `/v1/compress` endpoint 3. Replaces the original messages with the compressed result 4. Forwards the request to OpenAI as normal All other client methods are untouched: ```ts twoslash import { withHeadroom } from 'headroom-ai/openai'; import OpenAI from 'openai'; const client = withHeadroom(new OpenAI()); // These pass through unchanged const embedding = await client.embeddings.create({ model: 'text-embedding-3-small', input: 'Hello world', }); ``` Options [#options] Pass compression options as the second argument: ```ts twoslash import { withHeadroom } from 'headroom-ai/openai'; import OpenAI from 'openai'; const client = withHeadroom(new OpenAI(), { model: 'gpt-4o', baseUrl: 'http://localhost:8787', }); ``` Streaming [#streaming] Streaming works normally. Compression happens before the request is sent: ```ts twoslash import { withHeadroom } from 'headroom-ai/openai'; import OpenAI from 'openai'; const client = withHeadroom(new OpenAI()); const stream = await client.chat.completions.create({ model: 'gpt-4o', messages: longConversation, stream: true, }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0]?.delta?.content ?? ''); } ``` Tool calling [#tool-calling] Tool call messages and tool results are compressed like any other message content. Large tool outputs (JSON arrays, logs) see the biggest savings: ```ts twoslash import { withHeadroom } from 'headroom-ai/openai'; import OpenAI from 'openai'; const client = withHeadroom(new OpenAI()); const response = await client.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'user', content: 'Search for recent errors' }, { role: 'assistant', content: null, tool_calls: [{ id: 'call_1', type: 'function', function: { name: 'search', arguments: '{"q":"errors"}' } }], }, { role: 'tool', tool_call_id: 'call_1', content: hugeJsonResult, // Compressed automatically }, ], tools: [{ type: 'function', function: { name: 'search', parameters: {} } }], }); ``` # Proxy Server (/docs/proxy) The Headroom proxy is a standalone HTTP server that compresses all LLM traffic passing through it. Point any client at the proxy and get automatic context optimization. Starting the proxy [#starting-the-proxy] ```bash # Basic usage headroom proxy # Custom host and port headroom proxy --host 0.0.0.0 --port 8080 # With logging and budget headroom proxy \ --log-file /var/log/headroom.jsonl \ --budget 100.0 ``` Telemetry is enabled by default. Opt out with `HEADROOM_TELEMETRY=off` or `--no-telemetry`. CLI options [#cli-options] Core [#core] | Option | Default | Description | | ------------------ | ------------------------ | --------------------------------------- | | `--host` | `127.0.0.1` | Host to bind to | | `--port` | `8787` | Port to bind to | | `--no-optimize` | `false` | Disable optimization (passthrough mode) | | `--no-cache` | `false` | Disable semantic caching | | `--no-rate-limit` | `false` | Disable rate limiting | | `--log-file` | None | Path to JSONL log file | | `--budget` | None | Daily budget limit in USD | | `--openai-api-url` | `https://api.openai.com` | Custom OpenAI API URL | Context management [#context-management] | Option | Default | Description | | -------------------------- | ------- | ------------------------------------------------- | | `--no-intelligent-context` | `false` | Fall back to RollingWindow (oldest-first drops) | | `--no-intelligent-scoring` | `false` | Disable multi-factor importance scoring | | `--no-compress-first` | `false` | Disable trying deeper compression before dropping | By default, the proxy uses **IntelligentContextManager** which scores messages by recency, semantic similarity, TOIN-learned patterns, error indicators, and forward references. Dropped messages are stored in CCR for retrieval. ```bash # Use legacy RollingWindow headroom proxy --no-intelligent-context # Faster but less intelligent scoring headroom proxy --no-intelligent-scoring ``` LLMLingua (ML compression) [#llmlingua-ml-compression] | Option | Default | Description | | -------------------- | ------- | ---------------------------------------- | | `--llmlingua` | `false` | Enable LLMLingua-2 ML-based compression | | `--llmlingua-device` | `auto` | Device: `auto`, `cuda`, `cpu`, `mps` | | `--llmlingua-rate` | `0.3` | Target compression rate (0.3 = keep 30%) | ```bash pip install "headroom-ai[llmlingua]" headroom proxy --llmlingua --llmlingua-device cuda headroom proxy --llmlingua --llmlingua-rate 0.2 ``` LLMLingua adds \~2 GB of dependencies (torch, transformers), 10-30s cold start, and \~1 GB RAM. Enable when maximum compression justifies the cost. API endpoints [#api-endpoints] `GET /health` [#get-health] ```bash curl http://localhost:8787/health ``` ```json { "status": "healthy", "optimize": true, "stats": { "total_requests": 42, "tokens_saved": 15000, "savings_percent": 45.2 } } ``` `GET /stats` [#get-stats] Live session statistics plus durable `persistent_savings` totals. Stored at `~/.headroom/proxy_savings.json` (override with `HEADROOM_SAVINGS_PATH`). ```bash curl http://localhost:8787/stats ``` `GET /stats-history` [#get-stats-history] Durable history with hourly, daily, weekly, and monthly rollups. Powers the `/dashboard` view. ```bash curl http://localhost:8787/stats-history curl "http://localhost:8787/stats-history?format=csv&series=weekly" ``` `GET /metrics` [#get-metrics] Prometheus-format metrics for monitoring. ```bash curl http://localhost:8787/metrics ``` ``` headroom_requests_total{mode="optimize"} 1234 headroom_tokens_saved_total 5678900 headroom_compression_ratio_bucket{le="0.5"} 890 headroom_latency_seconds_bucket{le="0.01"} 800 headroom_cache_hits_total 456 ``` `POST /v1/messages` [#post-v1messages] Anthropic API format. The proxy compresses messages, forwards to Anthropic, and returns the response. `POST /v1/chat/completions` [#post-v1chatcompletions] OpenAI API format. The proxy compresses messages, forwards to OpenAI, and returns the response. `POST /v1/compress` [#post-v1compress] Compression-only endpoint. Compresses messages without calling any LLM. Used by the TypeScript SDK. **Request:** ```json { "messages": [{ "role": "user", "content": "..." }], "model": "gpt-4o" } ``` **Response:** ```json { "messages": [{ "role": "user", "content": "..." }], "tokens_before": 15000, "tokens_after": 3500, "tokens_saved": 11500, "compression_ratio": 0.23, "transforms_applied": ["router:smart_crusher:0.35"], "ccr_hashes": ["a1b2c3"] } ``` Set `x-headroom-bypass: true` to skip compression. Agent wrapping [#agent-wrapping] Use `headroom wrap` to transparently proxy any CLI tool: ```bash # Claude Code headroom wrap claude # OpenAI Codex headroom wrap codex # Aider headroom wrap aider # Cursor headroom wrap cursor ``` Or set the base URL manually: ```bash # Claude Code ANTHROPIC_BASE_URL=http://localhost:8787 claude # Cursor / any OpenAI-compatible client OPENAI_BASE_URL=http://localhost:8787/v1 cursor ``` Cloud providers [#cloud-providers] ```bash # AWS Bedrock headroom proxy --backend bedrock --region us-east-1 # Google Vertex AI headroom proxy --backend vertex_ai --region us-central1 # Azure OpenAI headroom proxy --backend azure # OpenRouter (400+ models) OPENROUTER_API_KEY=sk-or-... headroom proxy --backend openrouter ``` Environment variables [#environment-variables] ```bash export HEADROOM_HOST=0.0.0.0 export HEADROOM_PORT=8787 export HEADROOM_BUDGET=100.0 export OPENAI_TARGET_API_URL=https://custom.openai.endpoint.com headroom proxy ``` Production deployment [#production-deployment] gunicorn [#gunicorn] ```bash pip install gunicorn gunicorn headroom.proxy.server:app \ --workers 4 \ --bind 0.0.0.0:8787 \ --worker-class uvicorn.workers.UvicornWorker ``` Docker [#docker] ```dockerfile FROM python:3.11-slim RUN apt-get update && apt-get install -y --no-install-recommends build-essential \ && pip install "headroom-ai[proxy]" \ && apt-get purge -y build-essential && apt-get autoremove -y \ && rm -rf /var/lib/apt/lists/* EXPOSE 8787 CMD ["headroom", "proxy", "--host", "0.0.0.0"] ``` `build-essential` is required at install time because `headroom-ai` includes `hnswlib`, a C++ extension compiled from source. It is removed after installation to keep the image slim. # Quickstart (/docs/quickstart) This guide gets you from zero to compressed LLM calls in under 5 minutes. 1\. Install [#1-install] ```bash npm install headroom-ai ``` ```bash pip install "headroom-ai[all]" ``` The TypeScript SDK sends messages to a local Headroom proxy for compression. Start the proxy before using the TS SDK: ```bash pip install "headroom-ai[proxy]" headroom proxy --port 8787 ``` The proxy runs the compression pipeline (Python) and exposes an HTTP API that the TS SDK calls. 2\. Compress messages [#2-compress-messages] ```ts twoslash import { compress } from 'headroom-ai'; const messages = [ { role: 'system' as const, content: 'You analyze search results.' }, { role: 'user' as const, content: 'Search for Python tutorials.' }, { role: 'assistant' as const, content: null, tool_calls: [{ id: 'call_1', type: 'function' as const, function: { name: 'search', arguments: '{"q": "python"}' }, }], }, { role: 'tool' as const, tool_call_id: 'call_1', content: JSON.stringify({ results: Array.from({ length: 500 }, (_, i) => ({ title: `Result ${i}`, snippet: `Description ${i}`, score: 100 - i, })), }), }, { role: 'user' as const, content: 'What are the top 3 results?' }, ]; const result = await compress(messages, { model: 'gpt-4o', baseUrl: 'http://localhost:8787', }); ``` ```python from headroom import compress import json messages = [ {"role": "system", "content": "You analyze search results."}, {"role": "user", "content": "Search for Python tutorials."}, { "role": "assistant", "content": None, "tool_calls": [{ "id": "call_1", "type": "function", "function": {"name": "search", "arguments": '{"q": "python"}'}, }], }, { "role": "tool", "tool_call_id": "call_1", "content": json.dumps({ "results": [ {"title": f"Result {i}", "snippet": f"Description {i}", "score": 100 - i} for i in range(500) ] }), }, {"role": "user", "content": "What are the top 3 results?"}, ] result = compress(messages, model="gpt-4o") ``` 3\. Send to your LLM [#3-send-to-your-llm] Use the compressed messages exactly like the originals: ```ts twoslash import OpenAI from 'openai'; const client = new OpenAI(); // result.messages from the previous step const messages: any[] = []; const response = await client.chat.completions.create({ model: 'gpt-4o', messages, }); console.log(response.choices[0].message.content); ``` ```python from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=result.messages, ) print(response.choices[0].message.content) ``` 4\. Check your savings [#4-check-your-savings] ```ts twoslash const result = { tokensBefore: 45000, tokensAfter: 4500, tokensSaved: 40500, compressionRatio: 0.9, transformsApplied: ['smart_crusher', 'cache_aligner'], messages: [], ccrHashes: [], compressed: true, }; // ---cut--- console.log(`Tokens before: ${result.tokensBefore}`); console.log(`Tokens after: ${result.tokensAfter}`); console.log(`Tokens saved: ${result.tokensSaved}`); console.log(`Compression: ${(result.compressionRatio * 100).toFixed(0)}%`); console.log(`Transforms: ${result.transformsApplied.join(', ')}`); ``` Example output: ``` Tokens before: 45000 Tokens after: 4500 Tokens saved: 40500 Compression: 90% Transforms: smart_crusher, cache_aligner ``` ```python print(f"Tokens before: {result.tokens_before}") print(f"Tokens after: {result.tokens_after}") print(f"Tokens saved: {result.tokens_saved}") print(f"Compression: {result.compression_ratio:.0%}") print(f"Transforms: {result.transforms_applied}") ``` Example output: ``` Tokens before: 45000 Tokens after: 4500 Tokens saved: 40500 Compression: 90% Transforms: ['smart_crusher', 'cache_aligner'] ``` Alternative: proxy mode (zero code changes) [#alternative-proxy-mode-zero-code-changes] If you do not want to change any code, run Headroom as a proxy and point your existing client at it: ```bash # Start the proxy headroom proxy --port 8787 # Point Claude Code at it ANTHROPIC_BASE_URL=http://localhost:8787 claude # Or any OpenAI-compatible client OPENAI_BASE_URL=http://localhost:8787/v1 your-app ``` All requests flow through Headroom automatically. Check savings at any time: ```bash curl http://localhost:8787/stats # {"requests_total": 42, "tokens_saved_total": 125000, ...} ``` What gets compressed [#what-gets-compressed] The biggest savings come from tool outputs -- search results, database rows, log files, API responses. Headroom auto-detects the content type and routes it to the best compressor. No configuration needed. | Content type | Compressor | Typical savings | | --------------- | ---------------- | --------------- | | JSON arrays | SmartCrusher | 70--90% | | Source code | CodeCompressor | 40--70% | | Build/test logs | LogCompressor | 80--95% | | Search results | SearchCompressor | 60--80% | | Plain text | Kompress | 30--50% | Next steps [#next-steps] # SharedContext (/docs/shared-context) When agents hand off to each other, context gets replayed in full. SharedContext compresses what moves between agents using Headroom's compression pipeline, typically saving **\~80% of tokens** on agent handoffs. Quick Start [#quick-start] ```ts twoslash import { SharedContext } from "headroom"; const ctx = new SharedContext(); // Agent A stores large output const entry = await ctx.put("research", bigResearchOutput, { agent: "researcher", }); // Agent B gets compressed version (~80% smaller) const summary = ctx.get("research"); // Agent B needs full details on demand const full = ctx.get("research", { full: true }); ``` ```python from headroom import SharedContext ctx = SharedContext() # Agent A stores large output ctx.put("research", big_research_output, agent="researcher") # Agent B gets compressed version (~80% smaller) summary = ctx.get("research") # Agent B needs full details on demand full = ctx.get("research", full=True) ``` API [#api] `put(key, content, agent?)` [#putkey-content-agent] Store content under a key. Compresses automatically using Headroom's full pipeline (SmartCrusher for JSON, CodeCompressor for code, Kompress for text). ```ts twoslash import { SharedContext } from "headroom"; const ctx = new SharedContext(); // ---cut--- const entry = await ctx.put("findings", bigJsonOutput, { agent: "researcher", }); entry.originalTokens; // 20000 entry.compressedTokens; // 4000 entry.savingsPercent; // 80.0 entry.transforms; // ["router:json:0.20"] ``` ```python entry = ctx.put("findings", big_json_output, agent="researcher") entry.original_tokens # 20,000 entry.compressed_tokens # 4,000 entry.savings_percent # 80.0 entry.transforms # ["router:json:0.20"] ``` `get(key, full?)` [#getkey-full] Retrieve content. Returns the compressed version by default, or the original with `full=True`. ```ts twoslash import { SharedContext } from "headroom"; const ctx = new SharedContext(); // ---cut--- const compressed = ctx.get("findings"); // 4K tokens const original = ctx.get("findings", { full: true }); // 20K tokens const missing = ctx.get("nonexistent"); // null ``` ```python compressed = ctx.get("findings") # 4K tokens original = ctx.get("findings", full=True) # 20K tokens missing = ctx.get("nonexistent") # None ``` `stats()` [#stats] Aggregated statistics across all entries. ```ts twoslash import { SharedContext } from "headroom"; const ctx = new SharedContext(); // ---cut--- const stats = ctx.stats(); stats.entries; // 3 stats.totalOriginalTokens; // 60000 stats.totalCompressedTokens; // 12000 stats.totalTokensSaved; // 48000 stats.savingsPercent; // 80.0 ``` ```python stats = ctx.stats() stats.entries # 3 stats.total_original_tokens # 60000 stats.total_compressed_tokens # 12000 stats.total_tokens_saved # 48000 stats.savings_percent # 80.0 ``` `keys()` and `clear()` [#keys-and-clear] `keys()` lists all non-expired keys. `clear()` removes all entries. Configuration [#configuration] ```ts twoslash import { SharedContext } from "headroom"; // ---cut--- const ctx = new SharedContext({ model: "claude-sonnet-4-5-20250929", // For token counting ttl: 3600, // 1 hour (default) maxEntries: 100, // Evicts oldest when full }); ``` ```python ctx = SharedContext( model="claude-sonnet-4-5-20250929", # For token counting ttl=3600, # 1 hour (default) max_entries=100, # Evicts oldest when full ) ``` Entries expire after `ttl` seconds. When `maxEntries` is reached, the oldest entry is evicted. Framework Examples [#framework-examples] SharedContext is framework-agnostic. It works anywhere context moves between agents. CrewAI [#crewai] ```python from headroom import SharedContext ctx = SharedContext() # After researcher task completes ctx.put("findings", researcher_task.output.raw) # Coder task gets compressed context coder_context = ctx.get("findings") ``` LangGraph [#langgraph] ```python from headroom import SharedContext ctx = SharedContext() def researcher_node(state): result = do_research() ctx.put("research", result) return {"research_summary": ctx.get("research")} def coder_node(state): # Compressed summary in state, full details on demand full = ctx.get("research", full=True) return {"code": write_code(full)} ``` OpenAI Agents SDK [#openai-agents-sdk] ```python from headroom import SharedContext ctx = SharedContext() def compress_handoff(messages): for msg in messages: if len(msg.content) > 1000: ctx.put(msg.id, msg.content) msg.content = ctx.get(msg.id) return messages handoff(agent=coder, input_filter=compress_handoff) ``` How It Works [#how-it-works] Under the hood, `put()` calls `headroom.compress()` -- the same pipeline used by the Headroom proxy -- and stores the original in memory. `get()` returns the compressed version. `get(full=True)` returns the original. The compression pipeline routes content to the best compressor: * **JSON arrays** -- SmartCrusher (70-95% compression) * **Code** -- CodeCompressor (AST-aware) * **Text** -- Kompress (ModernBERT-based) or passthrough # Simulation (/docs/simulation) Simulation mode lets you preview what Headroom would do to your messages without sending them to an LLM. This is useful for cost estimation, debugging compression behavior, and understanding where token waste comes from. Basic Usage [#basic-usage] ```ts twoslash import { compress } from 'headroom-ai'; // compress() returns the same result structure — // use it without sending to your LLM to simulate const result = await compress(messages, { model: 'gpt-4o' }); console.log(`Would save: ${result.tokensSaved} tokens`); console.log(`Compression ratio: ${(result.compressionRatio * 100).toFixed(1)}%`); console.log(`Transforms: ${result.transformsApplied.join(', ')}`); ``` ```python plan = client.chat.completions.simulate( model="gpt-4o", messages=large_conversation, ) print(f"Tokens before: {plan.tokens_before}") print(f"Tokens after: {plan.tokens_after}") print(f"Would save: {plan.tokens_saved} tokens ({plan.savings_percent:.1f}%)") print(f"Transforms: {plan.transforms_applied}") ``` Waste Signals [#waste-signals] Simulation reports where token waste comes from in your messages: ```python plan = client.chat.completions.simulate( model="gpt-4o", messages=messages, ) waste = plan.waste_signals print(f"JSON bloat: {waste.json_bloat_tokens} tokens") print(f"HTML noise: {waste.html_noise_tokens} tokens") print(f"Whitespace: {waste.whitespace_tokens} tokens") print(f"Dynamic dates: {waste.dynamic_date_tokens} tokens") print(f"Repetition: {waste.repetition_tokens} tokens") ``` Waste signals help you understand which parts of your input are contributing the most unnecessary tokens. Block Breakdown [#block-breakdown] The parser breaks your conversation into blocks so you can see where tokens are concentrated: ```python # Block types: system, user, assistant, tool_call, tool_result, rag # The breakdown shows token counts per block type ``` | Block Kind | Description | | ------------- | ------------------------------------- | | `system` | System prompt instructions | | `user` | User messages | | `assistant` | Model responses | | `tool_call` | Function call requests | | `tool_result` | Tool output (largest source of waste) | | `rag` | Retrieved document context | Use Cases [#use-cases] Cost Estimation [#cost-estimation] Run simulation on a representative sample of your workload to estimate savings before enabling `optimize` mode: ```python import json total_before = 0 total_after = 0 for messages in sample_conversations: plan = client.chat.completions.simulate( model="gpt-4o", messages=messages, ) total_before += plan.tokens_before total_after += plan.tokens_after savings_pct = (1 - total_after / total_before) * 100 print(f"Estimated savings: {savings_pct:.1f}%") print(f"Tokens saved: {total_before - total_after:,}") ``` Debugging Compression [#debugging-compression] Use simulation to understand why a particular conversation is or is not being compressed: ```python plan = client.chat.completions.simulate( model="gpt-4o", messages=messages, ) if plan.tokens_saved == 0: print("No compression applied. Possible reasons:") print("- Messages are too short (< 200 tokens per tool output)") print("- No tool outputs with compressible JSON arrays") print("- Content is already compact (code, grep results)") else: print(f"Transforms applied: {plan.transforms_applied}") # See the optimized messages print(json.dumps(plan.messages_optimized, indent=2)) ``` Comparing Configurations [#comparing-configurations] Test different configurations to find the best settings for your workload: ```python from headroom import HeadroomClient, OpenAIProvider from headroom.transforms import SmartCrusherConfig configs = [ SmartCrusherConfig(max_items_after_crush=10), SmartCrusherConfig(max_items_after_crush=25), SmartCrusherConfig(max_items_after_crush=50), ] for config in configs: client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(), smart_crusher_config=config, ) plan = client.chat.completions.simulate(model="gpt-4o", messages=messages) print(f"max_items={config.max_items_after_crush}: " f"{plan.tokens_saved} tokens saved ({plan.savings_percent:.1f}%)") ``` Simulation never calls the LLM API. It runs the full transform pipeline locally and returns the results, so there is no cost and no latency from the provider. # SmartCrusher (/docs/smart-crusher) SmartCrusher is Headroom's compressor for JSON tool outputs. It analyzes arrays statistically, keeps the important items (errors, anomalies, relevant matches), and drops the rest. This is the compressor that fires automatically when ContentRouter detects JSON arrays. How It Works [#how-it-works] SmartCrusher doesn't blindly truncate arrays. It scores each item across five dimensions: 1. **First/Last items** -- Context for pagination and recency 2. **Error items** -- 100% preservation of error states (never dropped) 3. **Anomalies** -- Statistical outliers (> 2 standard deviations from the mean) 4. **Relevant items** -- Matches to the user's query via BM25/embeddings 5. **Change points** -- Significant transitions in data The result: a 1,000-item array becomes \~50 items with all the information the LLM actually needs. What Gets Preserved [#what-gets-preserved] | Category | Preserved | Why | | --------- | --------- | -------------------------- | | Errors | 100% | Critical for debugging | | First N | 100% | Context and pagination | | Last N | 100% | Recency | | Anomalies | All | Unusual values matter | | Relevant | Top K | Match user's query | | Others | Sampled | Statistical representation | Quick Start [#quick-start] ```ts twoslash import { compress } from "headroom-ai"; // SmartCrusher fires automatically for JSON tool outputs const messages = [ { role: "system" as const, content: "You are a helpful assistant." }, { role: "user" as const, content: "Find errors in the last 24 hours" }, { role: "tool" as const, content: JSON.stringify({ results: new Array(1000).fill({ status: "ok" }) }), tool_call_id: "call_1", }, ]; const result = await compress(messages); console.log(`Tokens saved: ${result.tokensSaved}`); // SmartCrusher keeps errors, anomalies, and relevant items ``` ```python from headroom import SmartCrusher crusher = SmartCrusher() # Before: 1000 search results (45,000 tokens) tool_output = {"results": ["...1000 items..."]} # After: ~50 important items (4,500 tokens) -- 90% reduction compressed = crusher.crush(tool_output, query="user's question") ``` Configuration [#configuration] ```ts twoslash import { compress } from "headroom-ai"; // Configure via the Headroom proxy or HeadroomClient const result = await compress(messages, { model: "gpt-4o", tokenBudget: 10000, // SmartCrusher will reduce JSON to fit }); console.log(`Transforms: ${result.transformsApplied}`); // ["smart_crusher", "cache_aligner"] ``` ```python from headroom import SmartCrusher, SmartCrusherConfig config = SmartCrusherConfig( min_tokens_to_crush=200, # Only compress if > 200 tokens max_items_after_crush=50, # Keep at most 50 items keep_first=3, # Always keep first 3 items keep_last=2, # Always keep last 2 items relevance_threshold=0.3, # Keep items with relevance > 0.3 anomaly_std_threshold=2.0, # Keep items > 2 std dev from mean preserve_errors=True, # Always keep error items ) crusher = SmartCrusher(config) compressed = crusher.crush(tool_output, query="find payment failures") ``` Configuration Options [#configuration-options] | Option | Default | Description | | ----------------------- | ------- | ---------------------------------------------------- | | `min_tokens_to_crush` | `200` | Only compress arrays with more than this many tokens | | `max_items_after_crush` | `50` | Maximum items to keep after compression | | `keep_first` | `3` | Always keep the first N items | | `keep_last` | `2` | Always keep the last N items | | `relevance_threshold` | `0.3` | Minimum relevance score to keep an item | | `anomaly_std_threshold` | `2.0` | Standard deviation threshold for anomaly detection | | `preserve_errors` | `True` | Always keep items containing error states | Example: Before and After [#example-before-and-after] Consider a tool that returns 1,000 search results: ```python # Before compression: 45,000 tokens { "results": [ {"id": 1, "status": "ok", "message": "Success", "timestamp": "..."}, {"id": 2, "status": "ok", "message": "Success", "timestamp": "..."}, # ... 995 more "ok" results ... {"id": 998, "status": "error", "message": "Connection timeout", "timestamp": "..."}, {"id": 999, "status": "ok", "message": "Success", "timestamp": "..."}, {"id": 1000, "status": "ok", "message": "Success", "timestamp": "..."}, ] } # After SmartCrusher: 4,500 tokens (90% reduction) # Kept: first 3, last 2, the error at id=998, statistical sample ``` The LLM sees the structure, the error, and a representative sample -- everything it needs to answer "find errors in the last 24 hours" without wading through 1,000 identical success responses. You don't need to call SmartCrusher directly. The ContentRouter detects JSON arrays and routes them to SmartCrusher automatically. Direct usage is available when you want fine-grained control over the configuration. # Strands (/docs/strands) Headroom integrates with [Strands Agents](https://github.com/strands-agents/sdk-python) through two patterns: wrap the model for full conversation compression, or hook into tool calls for targeted tool output compression. Installation [#installation] ```bash pip install headroom-ai strands-agents ``` Quick start [#quick-start] ```python from strands import Agent from strands.models.bedrock import BedrockModel from headroom.integrations.strands import HeadroomStrandsModel model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-20250514-v1:0") optimized = HeadroomStrandsModel(wrapped_model=model) agent = Agent(model=optimized) response = agent("Investigate the production incident") print(f"Tokens saved: {optimized.total_tokens_saved}") ``` Model wrapping [#model-wrapping] Wraps the Strands `Model` interface. Every call to `stream()` compresses messages before they reach the provider: ```python from headroom import HeadroomConfig from headroom.integrations.strands import HeadroomStrandsModel optimized = HeadroomStrandsModel( wrapped_model=model, config=HeadroomConfig(), ) agent = Agent(model=optimized) response = agent("Analyze these logs") ``` Hook provider (tool output compression) [#hook-provider-tool-output-compression] Compresses tool call results via Strands' hook system. Uses SmartCrusher on JSON arrays returned by tools: ```python from strands import Agent from strands.models.bedrock import BedrockModel from headroom.integrations.strands import HeadroomHookProvider model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-20250514-v1:0") hooks = HeadroomHookProvider( compress_tool_outputs=True, min_tokens_to_compress=200, preserve_errors=True, ) agent = Agent(model=model, hooks=[hooks]) response = agent("Search the database for recent failures") print(f"Tokens saved by hooks: {hooks.total_tokens_saved}") ``` The hook preserves error items, anomalous values (statistical outliers), items matching the query context, and boundary items (first/last). Both together [#both-together] Model wrapping compresses conversation history. Hooks compress individual tool results. Use both for maximum savings: ```python from headroom.integrations.strands import HeadroomStrandsModel, HeadroomHookProvider optimized = HeadroomStrandsModel(wrapped_model=model) hooks = HeadroomHookProvider(compress_tool_outputs=True) agent = Agent(model=optimized, hooks=[hooks]) ``` How it works [#how-it-works] ``` Agent decides to call tool | v Tool executes, returns result | v HeadroomHookProvider (optional) compresses tool result JSON | v Agent builds next API request | v HeadroomStrandsModel.stream() compresses full message list | v Provider API (Bedrock, etc.) ``` The model wrapper uses the full Headroom pipeline (CacheAligner, ContentRouter, IntelligentContext). The hook provider uses SmartCrusher directly for fast JSON compression. Structured output [#structured-output] ```python from pydantic import BaseModel class Analysis(BaseModel): severity: str root_cause: str recommendation: str result = optimized.structured_output(Analysis, messages) ``` Metrics [#metrics] ```python for m in optimized.metrics_history: print(f" {m.tokens_before} -> {m.tokens_after} ({m.tokens_saved} saved)") print(f"Total saved: {optimized.total_tokens_saved}") ``` Supported providers [#supported-providers] | Strands Model | Provider Detected | | -------------- | ------------------------ | | `BedrockModel` | Anthropic (via Bedrock) | | `OllamaModel` | OpenAI-compatible | | Custom `Model` | Falls back to estimation | # Text & Log Compression (/docs/text-and-logs) Headroom provides specialized compressors for text-based content that isn't JSON or source code. Each one understands the structure of its content type and preserves what the LLM needs while dropping the noise. | Compressor | Input Type | What It Preserves | Typical Savings | | --------------------- | -------------------------- | -------------------------------- | --------------- | | `SearchCompressor` | grep/ripgrep output | Relevant matches, file diversity | 80-95% | | `LogCompressor` | Build/test logs | Errors, stack traces, summaries | 85-95% | | `DiffCompressor` | Unified diffs | Changed lines, context | 60-80% | | `TextCompressor` | General text | Relevant paragraphs, anchors | 60-80% | | `LLMLinguaCompressor` | Any text (max compression) | Semantic meaning via ML | 80-95% | SearchCompressor [#searchcompressor] Compresses search results (grep, ripgrep, ag) while keeping the matches that matter. ```python from headroom.transforms import SearchCompressor search_results = """ src/utils.py:42:def process_data(items): src/utils.py:43: \"\"\"Process items.\"\"\" src/models.py:15:class DataProcessor: src/models.py:89: def process(self, items): ... hundreds more matches ... """ compressor = SearchCompressor() result = compressor.compress(search_results, context="find process") print(f"Compressed {result.original_match_count} matches to {result.compressed_match_count}") print(result.compressed) ``` **What gets preserved:** * Exact query matches (lines containing the search term) * High-relevance matches (scored by BM25 similarity) * File diversity (results from different files are kept) * First/last matches (context from start and end) Configuration [#configuration] ```python from headroom.transforms import SearchCompressor, SearchCompressorConfig config = SearchCompressorConfig( max_results=50, # Keep up to 50 matches preserve_file_diversity=True, # Ensure different files represented relevance_threshold=0.3, # Minimum relevance score to keep ) compressor = SearchCompressor(config) ``` LogCompressor [#logcompressor] Compresses build and test output while preserving errors, warnings, and summaries. ```python from headroom.transforms import LogCompressor build_output = """ ===== test session starts ===== collected 500 items tests/test_foo.py::test_1 PASSED ... hundreds of passed tests ... tests/test_bar.py::test_fail FAILED AssertionError: expected 5, got 3 ===== 1 failed, 499 passed ===== """ compressor = LogCompressor() result = compressor.compress(build_output) print(result.compressed) print(f"Compression ratio: {result.compression_ratio:.1%}") ``` **What gets preserved:** * Errors and failures (any line with ERROR, FAILED, Exception) * Warnings * Full stack traces for debugging * Test/build summary lines * Section headers (structural markers like `=====`) **What gets dropped:** * Hundreds of `PASSED` lines * Verbose success output * Repeated patterns DiffCompressor [#diffcompressor] Compresses unified diffs while keeping the actual changes and enough context to understand them. ```python from headroom.transforms import DiffCompressor diff_output = """ diff --git a/src/main.py b/src/main.py --- a/src/main.py +++ b/src/main.py @@ -42,7 +42,7 @@ def process(items): - return [x for x in items] + return [x.strip() for x in items if x] """ compressor = DiffCompressor() result = compressor.compress(diff_output) ``` TextCompressor [#textcompressor] General-purpose text compression with anchor preservation. Best for documentation, README files, and prose content. ```python from headroom.transforms import TextCompressor long_text = """ ... thousands of lines of documentation ... """ compressor = TextCompressor() result = compressor.compress(long_text, context="authentication") print(result.compressed) ``` **What gets preserved:** * Paragraphs relevant to the context query * Headers and section markers * Document structure and organization LLMLingua (Optional, Maximum Compression) [#llmlingua-optional-maximum-compression] For maximum compression on any text, Headroom integrates with Microsoft's LLMLingua-2, a BERT-based token classifier trained via GPT-4 distillation. It achieves up to 20x compression while preserving semantic meaning. ```python from headroom.transforms import LLMLinguaCompressor, LLMLinguaConfig config = LLMLinguaConfig( device="auto", # auto, cuda, cpu, mps code_compression_rate=0.4, # Conservative for code json_compression_rate=0.35, # Moderate for JSON text_compression_rate=0.25, # Aggressive for text ) compressor = LLMLinguaCompressor(config) result = compressor.compress(long_output) print(f"Before: {result.original_tokens} tokens") print(f"After: {result.compressed_tokens} tokens") print(f"Saved: {result.savings_percentage:.1f}%") ``` LLMLingua adds \~2GB of model weights and 50-200ms latency per request. Install it only when you need maximum compression: `pip install "headroom-ai[llmlingua]"` Memory Management [#memory-management] ```python from headroom.transforms import unload_llmlingua_model, is_llmlingua_model_loaded # Check if model is loaded print(is_llmlingua_model_loaded()) # True # Free ~1GB RAM when done unload_llmlingua_model() ``` Content Type Detection [#content-type-detection] If you're building your own routing logic, you can use the content type detector directly: ```python from headroom.transforms import detect_content_type, ContentType content = "src/main.py:42:def process():" detection = detect_content_type(content) if detection.content_type == ContentType.SEARCH_RESULTS: result = SearchCompressor().compress(content, context="process") elif detection.content_type == ContentType.BUILD_OUTPUT: result = LogCompressor().compress(content) elif detection.content_type == ContentType.PLAIN_TEXT: result = TextCompressor().compress(content, context="process") ``` When Each Compressor Is Used [#when-each-compressor-is-used] The ContentRouter selects the right compressor automatically. Here's when each fires: | Content Pattern | Compressor | Detection Signal | | -------------------------- | ------------------- | -------------------------------- | | `file:line:content` lines | SearchCompressor | grep/ripgrep output format | | pytest, npm, cargo markers | LogCompressor | Build tool output patterns | | `---/+++` and `@@` markers | DiffCompressor | Unified diff format | | Prose, documentation | TextCompressor | Fallback for non-structured text | | Any (max compression mode) | LLMLinguaCompressor | Explicitly enabled | Performance [#performance] | Compressor | Typical Input | Output | Speed | | ------------------- | ------------- | ------------------ | -------- | | SearchCompressor | 1,000 matches | 30-50 matches | \~2ms | | LogCompressor | 5,000 lines | 100-200 lines | \~3ms | | DiffCompressor | Large diff | Changed hunks only | \~2ms | | TextCompressor | 10,000 chars | 2,000 chars | \~2ms | | LLMLinguaCompressor | Any text | 5-20% of original | 50-200ms | # Troubleshooting (/docs/troubleshooting) Solutions for common Headroom issues. Proxy Server Issues [#proxy-server-issues] Proxy will not start [#proxy-will-not-start] **Symptom**: `headroom proxy` fails or hangs. ```bash # Check if port is already in use lsof -i :8787 # Try a different port headroom proxy --port 8788 # Check for missing dependencies pip install "headroom-ai[proxy]" # Run with debug logging headroom proxy --log-level debug ``` Connection refused when calling proxy [#connection-refused-when-calling-proxy] **Symptom**: `curl: (7) Failed to connect to localhost port 8787` ```bash # Verify proxy is running curl http://localhost:8787/health # Check if proxy started on a different port ps aux | grep headroom ``` Proxy returns errors for some requests [#proxy-returns-errors-for-some-requests] **Symptom**: Some requests work, others fail with 502/503. ```bash # Check proxy logs for the actual error headroom proxy --log-level debug # Verify API key is set echo $OPENAI_API_KEY # or ANTHROPIC_API_KEY # Test the underlying API directly curl https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY" ``` No Token Savings [#no-token-savings] **Symptom**: `stats['session']['tokens_saved_total']` is 0. **Diagnosis**: ```python stats = client.get_stats() print(f"Mode: {stats['config']['mode']}") # Should be "optimize" print(f"SmartCrusher: {stats['transforms']['smart_crusher_enabled']}") ``` **Common causes**: * Mode is `audit` (observation only, no modifications) * Messages do not contain tool outputs * Tool outputs are below the 200-token threshold * Data is not compressible (high uniqueness, code, grep results) **Solutions**: ```ts twoslash import { compress } from 'headroom-ai'; // Ensure the proxy is running in optimize mode // (default, unless --no-optimize was passed) const result = await compress(messages, { model: 'gpt-4o' }); console.log(`Saved: ${result.tokensSaved} tokens`); console.log(`Compressed: ${result.compressed}`); ``` ```python # 1. Ensure mode is "optimize" client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(), default_mode="optimize", # NOT "audit" ) # 2. Or override per-request response = client.chat.completions.create( model="gpt-4o", messages=messages, headroom_mode="optimize", ) # 3. Lower the compression threshold config = HeadroomConfig() config.smart_crusher.min_tokens_to_crush = 100 # Default is 200 ``` Compression Too Aggressive [#compression-too-aggressive] **Symptom**: LLM responses are missing information that was in tool outputs. ```python # 1. Keep more items config = HeadroomConfig() config.smart_crusher.max_items_after_crush = 50 # Default: 15 # 2. Skip compression for specific tools response = client.chat.completions.create( model="gpt-4o", messages=messages, headroom_tool_profiles={ "important_tool": {"skip_compression": True}, }, ) # 3. Disable SmartCrusher entirely config.smart_crusher.enabled = False ``` High Latency [#high-latency] **Symptom**: Requests take longer than expected. **Diagnosis**: ```python import time import logging logging.basicConfig(level=logging.DEBUG) start = time.time() response = client.chat.completions.create(...) print(f"Total time: {time.time() - start:.2f}s") ``` **Solutions**: ```python # 1. Use BM25 instead of embeddings (faster) config = HeadroomConfig() config.smart_crusher.relevance.tier = "bm25" # 2. Increase threshold to skip small payloads config.smart_crusher.min_tokens_to_crush = 500 # 3. Disable transforms you don't need config.cache_aligner.enabled = False config.rolling_window.enabled = False ``` Installation Issues [#installation-issues] pip install fails with C++ compilation error [#pip-install-fails-with-c-compilation-error] **Symptom**: `RuntimeError: Unsupported compiler -- at least C++11 support is needed!` ```bash # Linux / Debian-based (including Docker) apt-get install -y build-essential && pip install headroom-ai # macOS (Xcode command line tools) xcode-select --install && pip install headroom-ai ``` For Docker, install and remove build tools in one layer: ```dockerfile FROM python:3.11-slim RUN apt-get update && apt-get install -y --no-install-recommends build-essential \ && pip install "headroom-ai[proxy]" \ && apt-get purge -y build-essential && apt-get autoremove -y \ && rm -rf /var/lib/apt/lists/* ``` ModuleNotFoundError: No module named 'headroom' [#modulenotfounderror-no-module-named-headroom] ```bash # Check it is installed in the right environment pip show headroom-ai # If using virtual environment, ensure it is activated source venv/bin/activate # Reinstall pip install --upgrade headroom-ai ``` Missing optional dependency [#missing-optional-dependency] ```bash # For proxy server pip install "headroom-ai[proxy]" # For embedding-based relevance scoring pip install "headroom-ai[relevance]" # For code compression (tree-sitter) pip install "headroom-ai[code]" # For everything pip install "headroom-ai[all]" ``` Provider-Specific Issues [#provider-specific-issues] OpenAI: Invalid API key [#openai-invalid-api-key] ```python import os from openai import OpenAI api_key = os.environ.get("OPENAI_API_KEY") if not api_key: raise ValueError("OPENAI_API_KEY not set") client = HeadroomClient( original_client=OpenAI(api_key=api_key), provider=OpenAIProvider(), ) ``` Anthropic: Authentication error [#anthropic-authentication-error] ```python import os from anthropic import Anthropic api_key = os.environ.get("ANTHROPIC_API_KEY") client = HeadroomClient( original_client=Anthropic(api_key=api_key), provider=AnthropicProvider(), ) ``` Unknown model warnings [#unknown-model-warnings] ```python # For custom/fine-tuned models, specify context limit client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(), model_context_limits={ "ft:gpt-4o-2024-08-06:my-org::abc123": 128000, "my-custom-model": 32000, }, ) ``` ValidationError on Setup [#validationerror-on-setup] ```python result = client.validate_setup() print(result) # Common issues: # {"provider": {"ok": False, "error": "No API key"}} # -> Set OPENAI_API_KEY or pass api_key to OpenAI() # # {"storage": {"ok": False, "error": "unable to open database"}} # -> Check path permissions, use :memory: for testing # # {"config": {"ok": False, "error": "Invalid mode"}} # -> Use "audit" or "optimize" only ``` For testing, use in-memory storage: ```python client = HeadroomClient( original_client=OpenAI(), provider=OpenAIProvider(), store_url="sqlite:///:memory:", ) ``` Debugging Techniques [#debugging-techniques] Enable Full Logging [#enable-full-logging] ```python import logging # See everything logging.basicConfig( level=logging.DEBUG, format="%(asctime)s %(name)s %(levelname)s %(message)s", ) # Or just Headroom logs logging.getLogger("headroom").setLevel(logging.DEBUG) ``` Use Simulation to Inspect Transforms [#use-simulation-to-inspect-transforms] ```python plan = client.chat.completions.simulate( model="gpt-4o", messages=messages, ) print(f"Tokens: {plan.tokens_before} -> {plan.tokens_after}") print(f"Transforms: {plan.transforms_applied}") print(f"Waste signals: {plan.waste_signals}") import json print(json.dumps(plan.messages_optimized, indent=2)) ``` Test Transforms Directly [#test-transforms-directly] ```python from headroom import SmartCrusher, Tokenizer from headroom.config import SmartCrusherConfig import json config = SmartCrusherConfig() crusher = SmartCrusher(config) tokenizer = Tokenizer() messages = [ { "role": "tool", "content": json.dumps({"items": list(range(100))}), "tool_call_id": "1", } ] result = crusher.apply(messages, tokenizer) print(f"Tokens: {result.tokens_before} -> {result.tokens_after}") ``` Getting Help [#getting-help] 1. Enable debug logging and check the output 2. Use `simulate()` to see what transforms would apply 3. Run `validate_setup()` for configuration issues 4. File an issue at [github.com/headroom-sdk/headroom](https://github.com/headroom-sdk/headroom/issues) with your Headroom version, Python version, provider, debug log output, and minimal reproduction code # Vercel AI SDK (/docs/vercel-ai-sdk) Headroom integrates with the [Vercel AI SDK](https://sdk.vercel.ai) through three patterns: a one-liner wrapper, composable middleware, and standalone message compression. Installation [#installation] ```bash npm install headroom-ai ai @ai-sdk/openai ``` The TypeScript SDK sends messages to a local Headroom proxy for compression. Start the proxy before using the SDK: ```bash pip install "headroom-ai[proxy]" headroom proxy ``` withHeadroom() one-liner [#withheadroom-one-liner] The simplest integration. Wraps any Vercel AI SDK language model with automatic compression: ```ts twoslash import { withHeadroom } from 'headroom-ai/vercel-ai'; import { openai } from '@ai-sdk/openai'; import { generateText } from 'ai'; const model = withHeadroom(openai('gpt-4o')); const { text } = await generateText({ model, messages: [ { role: 'user', content: 'Summarize these results...' }, ], }); ``` `withHeadroom()` calls `wrapLanguageModel` + `headroomMiddleware()` under the hood. It works with any provider (`@ai-sdk/openai`, `@ai-sdk/anthropic`, `@ai-sdk/google`, etc.). headroomMiddleware() for composition [#headroommiddleware-for-composition] Use the middleware directly when you need to compose it with other middleware: ```ts twoslash // @noErrors import { headroomMiddleware } from 'headroom-ai/vercel-ai'; import { wrapLanguageModel } from 'ai'; import { openai } from '@ai-sdk/openai'; const model = wrapLanguageModel({ model: openai('gpt-4o'), middleware: headroomMiddleware(), }); ``` Pass options to control compression behavior: ```ts twoslash import { headroomMiddleware } from 'headroom-ai/vercel-ai'; const middleware = headroomMiddleware({ model: 'gpt-4o', baseUrl: 'http://localhost:8787', }); ``` compressVercelMessages() standalone [#compressvercelmessages-standalone] Compress Vercel-format messages directly without wrapping a model. Useful for custom pipelines: ```ts twoslash import { compressVercelMessages } from 'headroom-ai/vercel-ai'; const result = await compressVercelMessages(messages, { model: 'gpt-4o', }); console.log(`Saved ${result.tokensSaved} tokens`); // result.messages is in Vercel format, ready for the AI SDK ``` Streaming with streamText [#streaming-with-streamtext] Compression happens before the request. Streaming responses are unaffected: ```ts twoslash import { withHeadroom } from 'headroom-ai/vercel-ai'; import { openai } from '@ai-sdk/openai'; import { streamText } from 'ai'; const model = withHeadroom(openai('gpt-4o')); const result = streamText({ model, messages: longConversation, }); for await (const chunk of result.textStream) { process.stdout.write(chunk); } ``` generateObject with compressed context [#generateobject-with-compressed-context] Works with structured output: ```ts twoslash // @noErrors import { withHeadroom } from 'headroom-ai/vercel-ai'; import { openai } from '@ai-sdk/openai'; import { generateText, Output } from 'ai'; import { z } from 'zod'; const model = withHeadroom(openai('gpt-4o')); const { output } = await generateText({ model, output: Output.object({ schema: z.object({ summary: z.string(), severity: z.enum(['low', 'medium', 'high']), }), }), messages: largeConversationHistory, }); ``` How it works [#how-it-works] 1. Messages are converted from Vercel format to OpenAI format 2. Headroom compresses them via the proxy's `/v1/compress` endpoint 3. Compressed messages are converted back to Vercel format 4. The original model receives the smaller prompt All other model behavior (tool calling, structured output, streaming) is unchanged.