# Agno (/docs/agno)
Headroom integrates with [Agno](https://github.com/agno-agi/agno) (formerly Phidata) to compress context for AI agents. Wrap any Agno model for automatic optimization, and use hooks for observability.
Installation [#installation]
```bash
pip install "headroom-ai[agno]" agno
```
Quick start [#quick-start]
```python
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(model=model)
response = agent.run("What's the capital of France?")
print(f"Tokens saved: {model.total_tokens_saved}")
print(model.get_savings_summary())
# {'total_requests': 1, 'total_tokens_saved': 245, 'average_savings_percent': 12.3}
```
Works with any Agno provider:
```python
from agno.models.anthropic import Claude
from agno.models.google import Gemini
claude_model = HeadroomAgnoModel(Claude(id="claude-sonnet-4-20250514"))
gemini_model = HeadroomAgnoModel(Gemini(id="gemini-2.0-flash"))
```
Observability hooks [#observability-hooks]
Use hooks for detailed tracking without modifying your model:
```python
from headroom.integrations.agno import (
HeadroomAgnoModel,
HeadroomPreHook,
HeadroomPostHook,
)
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
pre_hook = HeadroomPreHook()
post_hook = HeadroomPostHook(token_alert_threshold=10000)
agent = Agent(
model=model,
pre_hooks=[pre_hook],
post_hooks=[post_hook],
)
response = agent.run("Analyze this large dataset...")
# Check for alerts
if post_hook.alerts:
print(f"{len(post_hook.alerts)} requests exceeded threshold")
```
Or use the convenience factory:
```python
from headroom.integrations.agno import create_headroom_hooks
pre_hook, post_hook = create_headroom_hooks(
token_alert_threshold=5000,
log_level="DEBUG",
)
```
Tool-heavy agents [#tool-heavy-agents]
Tool outputs (JSON, logs, search results) see the biggest compression gains at 70-90% reduction:
```python
from agno.tools.duckduckgo import DuckDuckGoTools
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(
model=model,
tools=[DuckDuckGoTools()],
show_tool_calls=True,
)
response = agent.run("Research the latest AI developments")
print(f"Tokens saved: {model.total_tokens_saved}")
```
Async support [#async-support]
```python
import asyncio
async def process():
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
response = await model.aresponse(messages)
async for chunk in model.aresponse_stream(messages):
print(chunk, end="", flush=True)
asyncio.run(process())
```
Standalone message optimization [#standalone-message-optimization]
Optimize messages without wrapping a model:
```python
from headroom.integrations.agno import optimize_messages
optimized, metrics = optimize_messages(messages, model="gpt-4o")
print(f"Tokens saved: {metrics['tokens_saved']}")
```
Session management [#session-management]
Reset metrics between sessions:
```python
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
# Session 1
agent.run("First conversation...")
print(model.get_savings_summary())
# Reset for new session
model.reset()
# Session 2 starts fresh
agent.run("Second conversation...")
```
Supported providers [#supported-providers]
| Provider | Agno Model | Auto-Detected |
| --------- | -------------------------- | ------------- |
| OpenAI | `OpenAIChat`, `OpenAILike` | Yes |
| Anthropic | `Claude`, `AwsBedrock` | Yes |
| Google | `Gemini`, `VertexAI` | Yes |
| Groq | `Groq` | Yes |
| Mistral | `Mistral` | Yes |
| Ollama | `Ollama` | Yes |
# Anthropic SDK (/docs/anthropic-sdk)
Headroom wraps the Anthropic TypeScript SDK to automatically compress messages before every `messages.create()` call. All other methods pass through unchanged.
Installation [#installation]
```bash
npm install headroom-ai @anthropic-ai/sdk
```
The TypeScript SDK sends messages to a local Headroom proxy for compression. Start the proxy before using the SDK:
```bash
pip install "headroom-ai[proxy]"
headroom proxy
```
Quick start [#quick-start]
```ts twoslash
import { withHeadroom } from 'headroom-ai/anthropic';
import Anthropic from '@anthropic-ai/sdk';
const client = withHeadroom(new Anthropic());
const response = await client.messages.create({
model: 'claude-sonnet-4-5-20250929',
messages: longConversation,
max_tokens: 1024,
});
```
Every call to `client.messages.create()` compresses messages first. The response format is identical to the unwrapped client.
How it works [#how-it-works]
`withHeadroom()` returns a proxy around your Anthropic client that intercepts `messages.create()`:
1. Converts Anthropic-format messages to OpenAI format (the compression engine's native format)
2. Sends them to the Headroom proxy's `/v1/compress` endpoint
3. Converts the compressed messages back to Anthropic format
4. Forwards the request to Anthropic as normal
Message format conversion [#message-format-conversion]
The adapter handles the full Anthropic message format including content blocks:
| Anthropic format | OpenAI format |
| ----------------------------------------------- | --------------------------------------------------------- |
| `{ type: "text", text: "..." }` | `{ role: "user", content: "..." }` |
| `{ type: "tool_use", id, name, input }` | `{ tool_calls: [{ id, function: { name, arguments } }] }` |
| `{ type: "tool_result", tool_use_id, content }` | `{ role: "tool", tool_call_id, content }` |
This conversion is lossless. Your request and response behave identically to an unwrapped client.
Options [#options]
Pass compression options as the second argument:
```ts twoslash
import { withHeadroom } from 'headroom-ai/anthropic';
import Anthropic from '@anthropic-ai/sdk';
const client = withHeadroom(new Anthropic(), {
model: 'claude-sonnet-4-5-20250929',
baseUrl: 'http://localhost:8787',
});
```
Streaming [#streaming]
Streaming works normally. Compression happens before the request:
```ts twoslash
import { withHeadroom } from 'headroom-ai/anthropic';
import Anthropic from '@anthropic-ai/sdk';
const client = withHeadroom(new Anthropic());
const stream = await client.messages.create({
model: 'claude-sonnet-4-5-20250929',
messages: longConversation,
max_tokens: 1024,
stream: true,
});
```
Tool use [#tool-use]
Tool results are where compression has the biggest impact. Large JSON payloads from tool calls are compressed automatically:
```ts twoslash
import { withHeadroom } from 'headroom-ai/anthropic';
import Anthropic from '@anthropic-ai/sdk';
const client = withHeadroom(new Anthropic());
const response = await client.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
messages: [
{ role: 'user', content: 'What went wrong?' },
{
role: 'assistant',
content: [
{ type: 'tool_use', id: 'toolu_1', name: 'get_logs', input: { service: 'api' } },
],
},
{
role: 'user',
content: [
{
type: 'tool_result',
tool_use_id: 'toolu_1',
content: hugeLogOutput, // Compressed automatically
},
],
},
],
tools: [{ name: 'get_logs', description: 'Get logs', input_schema: { type: 'object', properties: {} } }],
});
```
# API Reference (/docs/api-reference)
Complete API reference for the Headroom Python and TypeScript SDKs.
Core [#core]
HeadroomClient [#headroomclient]
The main entry point for the Headroom SDK.
```ts twoslash
import { HeadroomClient } from 'headroom-ai';
const client = new HeadroomClient({
baseUrl: 'http://localhost:8787',
apiKey: 'your-api-key',
timeout: 30_000,
fallback: true,
retries: 2,
});
```
**Constructor Parameters**
```python
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
)
```
chat.completions.create() [#chatcompletionscreate]
Create a chat completion with optional optimization.
The TypeScript SDK uses `compress()` to optimize messages before sending them to your LLM client:
```ts twoslash
import { compress } from 'headroom-ai';
const result = await compress(messages, {
model: 'gpt-4o',
tokenBudget: 100_000,
});
// Then pass result.messages to your LLM client
```
Accepts all standard OpenAI/Anthropic parameters plus Headroom-specific overrides:
```python
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
headroom_mode="optimize",
headroom_keep_turns=5,
headroom_tool_profiles={
"important_tool": {"skip_compression": True},
},
)
```
chat.completions.simulate() [#chatcompletionssimulate]
Preview optimization without making an API call.
```python
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=[...],
)
print(f"Tokens: {plan.tokens_before} -> {plan.tokens_after}")
print(f"Savings: {plan.savings_percent:.1f}%")
print(f"Transforms: {plan.transforms_applied}")
```
**Returns:** `SimulationResult`
compress() (TypeScript) [#compress-typescript]
Top-level function to compress messages via the Headroom proxy.
```ts twoslash
import { compress } from 'headroom-ai';
const result = await compress(messages, {
model: 'gpt-4o',
baseUrl: 'http://localhost:8787',
timeout: 15_000,
fallback: true,
retries: 2,
tokenBudget: 100_000,
});
```
get_stats() [#get_stats]
Quick stats for the current session (no database query).
```python
stats = client.get_stats()
# Returns dict with "session", "config", and "transforms" keys
```
get_metrics() [#get_metrics]
Query stored metrics from the database.
```python
from datetime import datetime, timedelta
metrics = client.get_metrics(
start_time=datetime.utcnow() - timedelta(hours=1),
limit=100,
)
```
get_summary() [#get_summary]
Aggregate statistics across all stored metrics.
```python
summary = client.get_summary()
# Returns dict with total_requests, total_tokens_saved,
# avg_compression_ratio, total_cost_saved_usd
```
validate_setup() [#validate_setup]
Validate that the client is configured correctly.
```python
result = client.validate_setup()
if not result["valid"]:
for issue in result["issues"]:
print(f" - {issue}")
```
***
Configuration [#configuration]
SmartCrusherConfig [#smartcrusherconfig]
```python
from headroom import SmartCrusherConfig
config = SmartCrusherConfig(
min_tokens_to_crush=200,
max_items_after_crush=50,
keep_first=3,
keep_last=2,
relevance_threshold=0.3,
anomaly_std_threshold=2.0,
preserve_errors=True,
)
```
CacheAlignerConfig [#cachealignerconfig]
```python
from headroom import CacheAlignerConfig
config = CacheAlignerConfig(
enabled=True,
extract_dates=True,
normalize_whitespace=True,
stable_prefix_min_tokens=100,
)
```
RollingWindowConfig [#rollingwindowconfig]
```python
from headroom import RollingWindowConfig
config = RollingWindowConfig(
max_tokens=100000,
preserve_system=True,
preserve_recent_turns=5,
drop_oldest_first=True,
)
```
IntelligentContextConfig [#intelligentcontextconfig]
```python
from headroom.config import IntelligentContextConfig, ScoringWeights
config = IntelligentContextConfig(
enabled=True,
keep_system=True,
keep_last_turns=2,
output_buffer_tokens=4000,
use_importance_scoring=True,
scoring_weights=ScoringWeights(),
toin_integration=True,
)
```
ScoringWeights [#scoringweights]
Weights are automatically normalized to sum to 1.0.
HeadroomConfig [#headroomconfig]
The top-level config object that contains all sub-configurations:
```python
from headroom import HeadroomConfig
config = HeadroomConfig()
config.smart_crusher.min_tokens_to_crush = 100
config.cache_aligner.enabled = True
config.rolling_window.preserve_recent_turns = 3
```
RelevanceScorerConfig [#relevancescorerconfig]
***
Results [#results]
CompressResult (TypeScript) [#compressresult-typescript]
SimulationResult (Python) [#simulationresult-python]
WasteSignals (Python) [#wastesignals-python]
RequestMetrics (Python) [#requestmetrics-python]
***
Providers [#providers]
OpenAIProvider [#openaiprovider]
```python
from headroom import OpenAIProvider
provider = OpenAIProvider(
enable_prefix_caching=True,
)
counter = provider.get_token_counter("gpt-4o")
tokens = counter.count_text("Hello, world!")
limit = provider.get_context_limit("gpt-4o") # 128000
cost = provider.estimate_cost(input_tokens=1000, output_tokens=500, model="gpt-4o")
```
AnthropicProvider [#anthropicprovider]
```python
from headroom import AnthropicProvider
from anthropic import Anthropic
provider = AnthropicProvider(
client=Anthropic(),
enable_cache_control=True,
)
counter = provider.get_token_counter("claude-3-5-sonnet-latest")
tokens = counter.count_messages(messages) # Accurate count via API
```
GoogleProvider [#googleprovider]
```python
from headroom import GoogleProvider
provider = GoogleProvider(
enable_context_caching=True,
)
```
***
Relevance Scoring [#relevance-scoring]
create_scorer() [#create_scorer]
Factory function to create scorers:
```python
from headroom import create_scorer
# Auto-select best available scorer
scorer = create_scorer()
# Explicitly choose type
scorer = create_scorer(scorer_type="hybrid", alpha=0.7)
```
BM25Scorer [#bm25scorer]
Fast keyword-based scoring (zero dependencies):
```python
from headroom import BM25Scorer
scorer = BM25Scorer()
scores = scorer.score_items(items=["item 1", "item 2"], query="search query")
```
EmbeddingScorer [#embeddingscorer]
Semantic similarity scoring (requires `headroom-ai[relevance]`):
```python
from headroom import EmbeddingScorer, embedding_available
if embedding_available():
scorer = EmbeddingScorer(model="all-MiniLM-L6-v2")
scores = scorer.score_items(items, query)
```
HybridScorer [#hybridscorer]
Combines BM25 and embeddings:
```python
from headroom import HybridScorer
scorer = HybridScorer(alpha=0.5) # 50% BM25, 50% embedding
scores = scorer.score_items(items, query)
```
***
Transforms (Direct Use) [#transforms-direct-use]
SmartCrusher [#smartcrusher]
```python
from headroom import SmartCrusher
crusher = SmartCrusher()
result = crusher.crush(data={"results": [...]}, query="user query")
```
CacheAligner [#cachealigner]
```python
from headroom import CacheAligner
aligner = CacheAligner()
result = aligner.align(messages)
```
RollingWindow [#rollingwindow]
```python
from headroom import RollingWindow
window = RollingWindow(config)
result = window.apply(messages, max_tokens=100000)
```
IntelligentContextManager [#intelligentcontextmanager]
```python
from headroom.transforms import IntelligentContextManager
from headroom.config import IntelligentContextConfig
config = IntelligentContextConfig(
keep_system=True,
keep_last_turns=2,
use_importance_scoring=True,
)
manager = IntelligentContextManager(config, toin=toin)
result = manager.apply(messages, tokenizer, model_limit=128000)
```
TransformPipeline [#transformpipeline]
```python
from headroom import TransformPipeline
pipeline = TransformPipeline([
SmartCrusher(),
CacheAligner(),
RollingWindow(),
])
result = pipeline.transform(messages)
```
***
Errors [#errors]
| Exception | Meaning |
| ------------------------- | ------------------------------------------------------- |
| `HeadroomError` | Base class for all errors |
| `HeadroomConnectionError` | Cannot reach proxy |
| `HeadroomAuthError` | 401 from proxy |
| `HeadroomCompressError` | Compression failed (includes `statusCode`, `errorType`) |
| `ConfigurationError` | Invalid configuration |
| `ProviderError` | Provider issues |
| `StorageError` | Storage failures |
| `TokenizationError` | Token counting failed |
| `CacheError` | Cache operations failed |
| `ValidationError` | Validation failures |
| `TransformError` | Transform execution failed |
Use `mapProxyError(status, type, message)` to convert proxy error responses to the correct class.
| Exception | Meaning |
| -------------------- | ------------------------------------ |
| `HeadroomError` | Base class for all Headroom errors |
| `ConfigurationError` | Invalid config values |
| `ProviderError` | Provider issue (unknown model, etc.) |
| `StorageError` | Database issue |
| `CompressionError` | Compression failed (rare) |
| `ValidationError` | Setup validation failed |
All exceptions include a `details` dict with additional context.
***
Utilities [#utilities]
Tokenizer [#tokenizer]
```python
from headroom import Tokenizer, count_tokens_text, count_tokens_messages
# Quick counting
tokens = count_tokens_text("Hello, world!", model="gpt-4o")
# With tokenizer instance
tokenizer = Tokenizer(model="gpt-4o")
tokens = tokenizer.count_text("Hello")
tokens = tokenizer.count_messages(messages)
```
generate_report() [#generate_report]
Generate HTML/Markdown reports from stored metrics:
```python
from headroom import generate_report
report = generate_report(
store_url="sqlite:///headroom.db",
format="html",
period="day",
)
```
***
TypeScript Message Types [#typescript-message-types]
The TypeScript SDK uses the standard OpenAI message format with `SystemMessage`, `UserMessage`, `AssistantMessage`, and `ToolMessage` variants.
# Architecture (/docs/architecture)
Headroom sits between your application and the LLM provider. It intercepts messages, compresses them intelligently, and forwards the optimized request. The response comes back unchanged.
High-Level Flow [#high-level-flow]
```
+---------------------------------------------------------------+
| YOUR APPLICATION |
+---------------------------------------------------------------+
|
v
+---------------------------------------------------------------+
| HEADROOM CLIENT |
| +-----------+ +------------+ +---------+ |
| | ANALYZE | > | TRANSFORM | > | CALL | |
| | (Parser) | | (Pipeline)| | (API) | |
| +-----------+ +------------+ +---------+ |
| | | | |
| v v v |
| Count tokens Apply compressions Send to LLM provider |
| Detect waste Preserve meaning Log metrics |
+---------------------------------------------------------------+
|
v
+---------------------------------------------------------------+
| OPENAI / ANTHROPIC / GOOGLE |
+---------------------------------------------------------------+
```
Entry Points [#entry-points]
Headroom can be used in three ways, all feeding into the same pipeline:
| Entry Point | How It Works | Code Changes |
| ---------------- | ------------------------------------------------ | ---------------------------------- |
| **SDK Mode** | Wrap your LLM client with `HeadroomClient` | Minimal -- swap client constructor |
| **Proxy Mode** | Run `headroom proxy` and point your client at it | Zero -- just change the base URL |
| **Integrations** | LangChain, Vercel AI SDK, Agno adapters | Framework-specific setup |
The Transform Pipeline [#the-transform-pipeline]
Messages flow through a sequence of transforms. Each transform is independent, safe to skip, and fails gracefully (returns original content unchanged).
Stage 1: Cache Aligner [#stage-1-cache-aligner]
Extracts dynamic content (dates, UUIDs, session tokens) from your system prompt and moves it to the end. This stabilizes the prefix so provider caches (Anthropic `cache_control`, OpenAI prefix caching) can hit on repeated calls.
```
Before: "You are helpful. Current Date: 2024-12-15"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Changes daily = cache miss every day
After: "You are helpful." [stable prefix]
"[Context: Current Date: 2024-12-15]" [dynamic tail]
```
Overhead: sub-millisecond.
Stage 2: Smart Crusher [#stage-2-smart-crusher]
Analyzes tool output content and compresses it using statistical methods. This is where the bulk of token savings come from.
**What it does:**
1. Parses JSON arrays in tool outputs
2. Runs field-level statistical analysis (variance, uniqueness, change points)
3. Selects a representative subset using the Kneedle algorithm on bigram coverage
4. Preserves errors, anomalies, and distribution boundaries unconditionally
5. Factors out constant fields shared by all items
**Strategies by content type:**
| Content | Strategy | Typical Savings |
| ---------------------- | ------------------------------------------- | --------------- |
| JSON arrays of dicts | Statistical sampling + anomaly preservation | 83--95% |
| JSON arrays of strings | Dedup + adaptive sampling | 60--90% |
| JSON arrays of numbers | Statistical summary + outlier preservation | 70--85% |
| Build/test logs | Pattern clustering | 85--94% |
| HTML | Article extraction (trafilatura-based) | \~95% |
**Item retention split:** 30% from array start (schema), 15% from end (recency), 55% by importance score. Error items are always kept regardless of budget.
Overhead: 1--50ms for typical payloads. Scales linearly with input size.
Stage 3: Context Manager [#stage-3-context-manager]
Ensures the final message array fits within the model's context window.
**Rolling Window** (default): Drops oldest messages first, preserving system prompt and recent turns. Tool calls and their responses are dropped as atomic units.
**Intelligent Context** (advanced): Scores every message on six dimensions (recency, semantic similarity, TOIN importance, error indicators, forward references, token density) and drops the lowest-scored messages first. Dropped messages are stored in CCR for potential retrieval.
Overhead: sub-millisecond for Rolling Window; depends on scoring config for Intelligent Context.
Provider Cache Optimization [#provider-cache-optimization]
After the pipeline, Headroom applies provider-specific cache hints:
| Provider | Mechanism | Savings |
| --------- | --------------------------------------- | -------------------------- |
| Anthropic | `cache_control` blocks on stable prefix | Up to 90% on cached tokens |
| OpenAI | Prefix alignment for automatic caching | Up to 50% on cached tokens |
| Google | `CachedContent` API | Up to 75% on cached tokens |
CCR: Compress-Cache-Retrieve [#ccr-compress-cache-retrieve]
When SmartCrusher compresses a tool output or Intelligent Context drops messages, the original content is stored in a local compression cache. If the LLM needs the full data, it can request retrieval via a `ccr_retrieve` tool call. This makes compression reversible.
```
Compress: 1000 items -> 15 items (stored original in CCR)
Cache: Hash-indexed local store (SQLite)
Retrieve: LLM calls ccr_retrieve("abc123") -> original 1000 items
```
TOIN: Tool Output Intelligence Network [#toin-tool-output-intelligence-network]
TOIN learns compression patterns across sessions and users. When a tool is used repeatedly, TOIN builds up statistics about which fields matter, which items get retrieved, and what compression strategies work best. These learned patterns feed back into SmartCrusher and Intelligent Context scoring.
Cold start: For new tool types, TOIN falls back to statistical heuristics. Patterns build up over time as tools are used.
What Headroom Does NOT Touch [#what-headroom-does-not-touch]
* **User messages**: Never compressed (the user's intent must be preserved exactly)
* **System prompts**: Content preserved; only dynamic parts are relocated for caching
* **Code**: Passes through unchanged unless tree-sitter AST compression is explicitly enabled
* **Model responses**: Returned unchanged from the provider
* **Short content**: Tool outputs under 200 tokens pass through (overhead exceeds savings)
# Benchmarks (/docs/benchmarks)
Headroom's core promise: compress context without losing accuracy. This page covers compression benchmarks, accuracy evaluations, latency overhead, and production telemetry.
Compression Performance [#compression-performance]
Tested on Apple M-series (CPU), Headroom v0.5.18. Each test runs `compress()` on realistic tool outputs.
| Content Type | Original | Compressed | Saved | Ratio | Latency |
| --------------------------- | ---------- | ---------- | ---------- | --------- | ------- |
| JSON array (100 items) | 3,163 | 297 | 2,866 | **90.6%** | 1ms |
| JSON array (500 items) | 9,526 | 1,614 | 7,912 | **83.1%** | 2ms |
| Shell output (200 lines) | 3,238 | 469 | 2,769 | **85.5%** | 1ms |
| Build log (200 lines) | 2,412 | 148 | 2,264 | **93.9%** | 1ms |
| grep results (150 hits) | 2,624 | 2,624 | 0 | 0.0% | \<1ms |
| Python source (\~480 lines) | 2,958 | 2,958 | 0 | 0.0% | \<1ms |
| **Total** | **23,921** | **8,110** | **15,811** | **66.1%** | **5ms** |
grep results and Python source show 0% compression. These are already compact structured formats. SmartCrusher only compresses JSON arrays; code passes through to preserve correctness.
Accuracy Benchmarks [#accuracy-benchmarks]
HTML Extraction [#html-extraction]
**Dataset**: Scrapinghub Article Extraction Benchmark (181 HTML pages with ground truth)
| Metric | Value |
| --------------- | ----- |
| **F1 Score** | 0.919 |
| **Precision** | 0.879 |
| **Recall** | 0.982 |
| **Compression** | 94.9% |
For LLM applications, recall is critical -- 98.2% means nearly all article content is preserved. The slight precision drop (some extra content) does not hurt LLM accuracy.
JSON Compression (SmartCrusher) [#json-compression-smartcrusher]
**Test**: 100 production log entries with critical error at position 67. Task: find the error, error code, resolution, and affected count.
| Metric | Baseline | Headroom |
| --------------- | -------- | --------- |
| Input tokens | 10,144 | 1,260 |
| Correct answers | 4/4 | **4/4** |
| Compression | -- | **87.6%** |
SmartCrusher preserves first N items (schema), last N items (recency), all anomalies (errors, warnings), and statistical distribution.
QA Accuracy Preservation [#qa-accuracy-preservation]
| Metric | Original HTML | Extracted | Delta |
| ----------- | ------------- | --------- | ----- |
| F1 Score | 0.85 | 0.87 | +0.02 |
| Exact Match | 60% | 62% | +2% |
Removing HTML noise sometimes helps LLMs focus on relevant content, leading to slightly higher scores on extraction benchmarks.
Latency Overhead [#latency-overhead]
SDK Compression Latency [#sdk-compression-latency]
Measured per-scenario on Apple M-series (CPU):
| Scenario | Tokens In | Tokens Out | Saved | p50 (ms) | p95 (ms) |
| -------------------------------- | --------- | ---------- | ----- | -------- | -------- |
| JSON: Search Results (100 items) | 10.2K | 1.5K | 8.7K | 189 | 231 |
| JSON: Search Results (500 items) | 50.2K | 1.5K | 48.7K | 943 | 955 |
| JSON: Search Results (1K items) | 100.5K | 1.5K | 99.0K | 2,012 | 2,198 |
| JSON: API Responses (500 items) | 38.9K | 1.1K | 37.8K | 743 | 776 |
| JSON: Database Rows (1K rows) | 43.7K | 605 | 43.1K | 961 | 1,104 |
| JSON: String Array (100 strings) | 1.1K | 231 | 820 | 15 | 15 |
| JSON: String Array (500 strings) | 4.9K | 233 | 4.6K | 72 | 80 |
| JSON: Number Array (200 numbers) | 1.2K | 192 | 1.1K | 31 | 62 |
| JSON: Mixed Array (250 items) | 2.3K | 368 | 1.9K | 38 | 40 |
Cost-Benefit Analysis [#cost-benefit-analysis]
Net latency benefit = LLM time saved from fewer tokens minus compression overhead (at Claude Sonnet pricing, $3.0/MTok):
| Scenario | Compress (ms) | LLM Saved (ms) | Net Benefit | Savings per 1K Requests |
| -------------------------------- | ------------- | -------------- | ----------- | ----------------------- |
| JSON: Search Results (100 items) | 189 | 261 | **+72ms** | $26 |
| JSON: Search Results (500 items) | 943 | 1,461 | **+518ms** | $146 |
| JSON: Search Results (1K items) | 2,012 | 2,969 | **+957ms** | $297 |
| JSON: API Responses (500 items) | 743 | 1,134 | **+391ms** | $113 |
| JSON: Database Rows (1K rows) | 961 | 1,292 | **+331ms** | $129 |
Compression pays for itself in latency for 11 of 12 tested scenarios against Claude Sonnet. Slower and more expensive models (Opus) benefit even more.
Pipeline Step Timing [#pipeline-step-timing]
| Step | Median | P90 | Description |
| --------------------- | ------ | ----- | -------------------------------- |
| `pipeline_total` | 16.9ms | 289ms | Full compression pipeline |
| `content_router` | 11.7ms | 259ms | Content detection + routing |
| `smart_crusher` | 50.1ms | 50ms | JSON array compression |
| `text_compressor` | 32.0ms | 576ms | Text compression (Kompress ONNX) |
| `initial_token_count` | 2.9ms | 16ms | Token counting (tiktoken) |
ContentRouter accounts for 91--98% of pipeline cost on average. CacheAligner and RollingWindow are sub-millisecond.
Production Telemetry [#production-telemetry]
Real-world data from **50,000+ proxy sessions** across 250+ unique instances (March--April 2026). Collected via anonymous telemetry (opt-out: `HEADROOM_TELEMETRY=off`).
Proxy Overhead [#proxy-overhead]
| Percentile | Latency |
| ---------------- | -------- |
| **Median (P50)** | **52ms** |
| P90 | 309ms |
| P99 | 4,172ms |
| Mean | 161ms |
The median 52ms overhead is negligible compared to LLM inference time (typically 2--10 seconds).
Compression Rate [#compression-rate]
| Percentile | Compression |
| ---------- | ----------- |
| P25 | 4.8% |
| **Median** | **4.8%** |
| P75 | 6.9% |
| Mean | 11.3% |
Median compression is modest because many requests are short conversational turns. Heavy tool-use sessions (file reads, shell output) see 40--80% compression.
Fleet Summary [#fleet-summary]
| Metric | Value |
| ------------------ | -------------------------------- |
| Clean instances | 249 |
| Total tokens saved | 1.4 billion |
| Total savings | \~$4,000 |
| OS distribution | Linux 57%, macOS 38%, Windows 5% |
Reproducing Results [#reproducing-results]
```bash
git clone https://github.com/chopratejas/headroom.git
cd headroom
pip install -e ".[evals,html]"
pytest tests/test_evals/ -v -s
```
# Cache Optimization (/docs/cache-optimization)
LLM providers cache prompt prefixes to avoid reprocessing identical input on repeated calls. Headroom's **CacheAligner** stabilizes your message prefixes so these caches actually hit, and then applies provider-specific strategies to maximize savings.
How CacheAligner works [#how-cachealigner-works]
System prompts often contain dynamic content -- today's date, session IDs, timestamps -- that changes between requests. Even a single character difference at the start of a prompt invalidates the entire provider cache.
CacheAligner solves this by extracting dynamic content and moving it to the end of the message, keeping the prefix stable:
```
Before:
"You are helpful. Current Date: 2025-04-06" <- changes daily, no cache hit
After:
"You are helpful." <- stable prefix, cache hit
"[Context: Current Date: 2025-04-06]" <- dynamic part moved to tail
```
The prefix stays byte-identical across requests, so the provider's KV cache can reuse previously computed attention states.
Provider-specific strategies [#provider-specific-strategies]
Each LLM provider implements caching differently. Headroom applies the optimal strategy for each.
Anthropic [#anthropic]
Anthropic supports explicit `cache_control` blocks that mark content as cacheable. Cached input tokens cost **90% less** than regular input tokens.
Headroom automatically inserts `cache_control` breakpoints at the right positions in your messages so that stable prefixes (system prompts, early conversation turns) are cached across requests.
| Metric | Value |
| ------------------- | --------------------------- |
| Cache read discount | 90% off input price |
| Cache write cost | 25% premium on first write |
| Cache TTL | 5 minutes (extended on hit) |
OpenAI [#openai]
OpenAI uses automatic **prefix caching** -- if consecutive requests share the same message prefix, the provider reuses cached KV states. No explicit API markers are needed, but the prefix must be byte-identical.
CacheAligner ensures your prefixes remain stable by extracting dynamic content, which is the key requirement for OpenAI prefix caching to work.
| Metric | Value |
| ------------------- | ------------------------ |
| Cache read discount | 50% off input price |
| Activation | Automatic (prefix match) |
| Min prefix length | 1024 tokens |
Google [#google]
Google provides the **CachedContent API**, which lets you explicitly cache large context (system instructions, documents, tools) and reference it across requests. Cached tokens cost **75% less**.
Headroom can manage CachedContent lifecycle automatically, creating and refreshing cached content objects as needed.
| Metric | Value |
| ------------------- | ---------------------------------- |
| Cache read discount | 75% off input price |
| Mechanism | Explicit CachedContent API objects |
| Min cache size | 32,768 tokens |
Configuration [#configuration]
```ts twoslash
import { compress } from "headroom-ai";
import type {
CacheAlignerConfig,
CacheOptimizerConfig,
HeadroomConfig,
} from "headroom-ai";
// CacheAligner: stabilize prefixes for cache hits
const cacheAligner: CacheAlignerConfig = {
enabled: true,
datePatterns: [
"Today is \\w+ \\d+, \\d{4}",
"Current time: .*",
],
normalizeWhitespace: true,
collapseBlankLines: true,
};
// CacheOptimizer: provider-level caching
const cacheOptimizer: CacheOptimizerConfig = {
enabled: true,
autoDetectProvider: true, // Detect Anthropic/OpenAI/Google automatically
minCacheableTokens: 1024,
};
// Full configuration
const config: HeadroomConfig = {
cacheAligner,
cacheOptimizer,
};
// Compress with cache optimization
const result = await compress(messages, {
model: "claude-sonnet-4-20250514",
config,
});
```
```python
from headroom import HeadroomClient, OpenAIProvider, AnthropicProvider, GoogleProvider
from headroom.transforms import CacheAlignerConfig
from openai import OpenAI
# CacheAligner configuration
aligner_config = CacheAlignerConfig(
enabled=True,
dynamic_patterns=[
r"Today is \w+ \d+, \d{4}",
r"Current time: .*",
r"Session ID: [a-f0-9-]+",
],
)
# Provider-specific cache settings
# OpenAI: prefix caching (automatic, just keep prefixes stable)
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(enable_prefix_caching=True),
enable_cache_optimizer=True,
)
# Anthropic: cache_control blocks (90% read discount)
from anthropic import Anthropic
client = HeadroomClient(
original_client=Anthropic(),
provider=AnthropicProvider(enable_cache_control=True),
enable_cache_optimizer=True,
)
# Google: CachedContent API (75% read discount)
client = HeadroomClient(
original_client=google_client,
provider=GoogleProvider(enable_context_caching=True),
enable_cache_optimizer=True,
)
```
How savings compound [#how-savings-compound]
CacheAligner and provider caching work together with Headroom's compression transforms:
1. **SmartCrusher** reduces token count by 70-90%
2. **CacheAligner** stabilizes prefixes so provider caches hit
3. **Provider caching** discounts the remaining input tokens by 50-90%
For example, with Anthropic:
* 100K input tokens compressed to 20K (80% savings from SmartCrusher)
* 18K of those 20K hit the cache (90% cache read discount)
* Effective cost: 2K full-price tokens + 18K at 10% = 3.8K equivalent tokens
* **Total savings: 96.2%** compared to the original 100K tokens
# Reversible Compression (CCR) (/docs/ccr)
Headroom's CCR (Compress-Cache-Retrieve) architecture makes compression **reversible**. When content is compressed, the original data is cached locally. If the LLM needs the full data, it retrieves it instantly.
Unlike traditional lossy compression, CCR guarantees that every piece of original data remains accessible. You get 70-90% token savings with zero risk of permanent data loss.
The problem with traditional compression [#the-problem-with-traditional-compression]
Traditional compression forces a difficult tradeoff:
* **Aggressive compression** risks losing data the LLM needs
* **Conservative compression** misses out on token savings
CCR eliminates this tradeoff entirely. Compress aggressively, retrieve on demand.
Architecture [#architecture]
CCR flows through four phases:
```
TOOL OUTPUT (1000 items)
-> SmartCrusher compresses to 20 items
-> Original cached with hash=abc123
-> Retrieval tool injected into context
LLM PROCESSING
Option A: LLM solves task with 20 items -> Done (90% savings)
Option B: LLM calls headroom_retrieve(hash=abc123)
-> Response Handler returns full data automatically
```
Phase 1: Compression Store [#phase-1-compression-store]
When SmartCrusher compresses tool output:
1. The original content is stored in an LRU cache
2. A hash key is generated for retrieval
3. A marker is added to the compressed output:
```
[1000 items compressed to 20. Retrieve more: hash=abc123]
```
Phase 2: Tool Injection [#phase-2-tool-injection]
Headroom injects a `headroom_retrieve` tool into the LLM's available tools:
```json
{
"name": "headroom_retrieve",
"description": "Retrieve original uncompressed data from Headroom cache",
"parameters": {
"hash": "The hash key from the compression marker",
"query": "Optional: search within the cached data"
}
}
```
The LLM sees this tool alongside your application's tools and can call it whenever the compressed data is insufficient.
Phase 3: Response Handler [#phase-3-response-handler]
When the LLM calls `headroom_retrieve`:
1. The Response Handler intercepts the tool call
2. Data is retrieved from the local cache (around 1ms)
3. The result is added to the conversation
4. The API call continues automatically
The client never sees CCR tool calls -- they are handled transparently by Headroom.
Phase 4: Context Tracker [#phase-4-context-tracker]
Across multiple turns, the Context Tracker maintains awareness of all compressed content:
1. Remembers what was compressed in earlier turns
2. Analyzes new queries for relevance to compressed content
3. Proactively expands relevant data before the LLM asks
```
Turn 1: User searches for files
-> 500 files compressed to 15, cached (hash=abc123)
-> LLM answers with 15 files
Turn 5: User asks "What about the auth middleware?"
-> Context Tracker detects "auth" may match cached content
-> Proactively expands compressed data
-> LLM finds auth_middleware.py in the full list
```
BM25 search within compressed data [#bm25-search-within-compressed-data]
The LLM does not have to retrieve everything. It can search within compressed data using the optional `query` parameter:
```json
{
"name": "headroom_retrieve",
"parameters": {
"hash": "abc123",
"query": "authentication errors"
}
}
```
This runs a BM25 search over the cached items, returning only the relevant subset instead of the full original payload.
Retrieving originals [#retrieving-originals]
CCR works automatically through the proxy, but you can also retrieve cached data programmatically:
```ts twoslash
import { compress } from "headroom-ai";
import type { CCRConfig } from "headroom-ai";
// CCR is enabled by default when compressing through the proxy.
const result = await compress(messages, {
model: "gpt-4o",
});
// Access compressed messages — CCR markers are embedded automatically
console.log(result.messages);
// CCR configuration options
const ccrConfig: CCRConfig = {
enabled: true,
injectTool: true, // Inject headroom_retrieve tool
injectRetrievalMarker: true, // Add retrieval markers to compressed output
feedbackEnabled: true, // Learn from retrieval patterns
storeMaxEntries: 1000, // Max cached items
storeTtlSeconds: 3600, // Cache TTL
};
```
```python
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
)
# CCR happens automatically during chat completions.
# The LLM calls headroom_retrieve when it needs more data.
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
# CCR is enabled by default. To disable:
# headroom proxy --no-ccr-responses
# To disable proactive expansion:
# headroom proxy --no-ccr-expansion
```
Message-level CCR [#message-level-ccr]
CCR is not limited to tool outputs. When IntelligentContext drops low-importance messages to fit the context budget, those messages are also stored in CCR:
```
100-message conversation (50K tokens)
-> IntelligentContext scores messages by importance
-> Drops 60 low-scoring messages
-> Dropped messages cached with hash=def456
-> Marker inserted: "60 messages dropped, retrieve: def456"
```
The marker includes the CCR reference so the LLM can recover earlier context:
```
[Earlier context compressed: 60 message(s) dropped by importance scoring.
Full content available via ccr_retrieve tool with reference 'def456'.]
```
When users retrieve dropped messages via CCR, TOIN learns those message patterns are important and scores them higher in future sessions -- improving drop decisions across all users.
CCR-enabled components [#ccr-enabled-components]
| Component | What it compresses | CCR integration |
| ---------------------- | -------------------------------- | --------------------------------------------- |
| **SmartCrusher** | JSON arrays (tool outputs) | Stores original array, marker includes hash |
| **ContentRouter** | Code, logs, search results, text | Stores original content by strategy |
| **IntelligentContext** | Messages (conversation turns) | Stores dropped messages, marker includes hash |
Why CCR matters [#why-ccr-matters]
| Approach | Risk | Savings |
| ----------------------- | ----------------- | ------- |
| No compression | None | 0% |
| Traditional compression | Data loss | 70-90% |
| CCR compression | None (reversible) | 70-90% |
CCR gives you the savings of aggressive compression with zero risk. The LLM can always retrieve the original data if needed.
# Code Compression (/docs/code-compression)
Headroom's CodeAwareCompressor uses tree-sitter to parse source code into an AST, then selectively compresses function bodies while preserving the structural elements that LLMs need -- imports, signatures, type annotations, and error handlers.
Why AST-Aware Compression? [#why-ast-aware-compression]
Naive truncation breaks code. Cutting a function in half leaves invalid syntax that confuses the LLM. CodeAwareCompressor guarantees:
* **Syntax validity** -- output always parses correctly
* **Structural preservation** -- imports, signatures, types, decorators are kept intact
* **Lightweight** -- \~50MB (tree-sitter) vs \~1GB for LLMLingua
Supported Languages [#supported-languages]
| Tier | Languages | Support Level |
| ------ | ------------------------------ | ------------------------- |
| Tier 1 | Python, JavaScript, TypeScript | Full AST analysis |
| Tier 2 | Go, Rust, Java, C, C++ | Function body compression |
What Gets Preserved vs Compressed [#what-gets-preserved-vs-compressed]
**Always preserved:**
* Import statements
* Function and method signatures
* Class definitions
* Type annotations
* Decorators
* Error handlers (`try`/`except`, `try`/`catch`)
**Compressed:**
* Function bodies (implementations)
* Comments (unless configured to preserve)
* Verbose docstrings (configurable: full, first line, or removed)
Example [#example]
```python
from headroom.transforms import CodeAwareCompressor
compressor = CodeAwareCompressor()
code = '''
import os
from typing import List
def process_items(items: List[str]) -> List[str]:
"""Process a list of items."""
results = []
for item in items:
if not item:
continue
processed = item.strip().lower()
results.append(processed)
return results
'''
result = compressor.compress(code, language="python")
print(result.compressed)
# import os
# from typing import List
#
# def process_items(items: List[str]) -> List[str]:
# """Process a list of items."""
# results = []
# for item in items:
# # ... (5 lines compressed)
# pass
print(f"Compression: {result.compression_ratio:.0%}") # ~55%
print(f"Syntax valid: {result.syntax_valid}") # True
```
Configuration [#configuration]
```python
from headroom.transforms import CodeAwareCompressor, CodeCompressorConfig, DocstringMode
config = CodeCompressorConfig(
preserve_imports=True, # Always keep imports
preserve_signatures=True, # Always keep function signatures
preserve_type_annotations=True, # Keep type hints
preserve_error_handlers=True, # Keep try/except blocks
preserve_decorators=True, # Keep decorators
docstring_mode=DocstringMode.FIRST_LINE, # FULL, FIRST_LINE, REMOVE
target_compression_rate=0.2, # Keep 20% of tokens
max_body_lines=5, # Lines to keep per function body
min_tokens_for_compression=100, # Skip small content
language_hint=None, # Auto-detect if None
fallback_to_llmlingua=True, # Use LLMLingua for unknown langs
)
compressor = CodeAwareCompressor(config)
result = compressor.compress(code)
```
Configuration Options [#configuration-options]
| Option | Default | Description |
| ---------------------------- | ------------ | -------------------------------------------------------- |
| `preserve_imports` | `True` | Keep all import statements |
| `preserve_signatures` | `True` | Keep function/method signatures |
| `preserve_type_annotations` | `True` | Keep type hints |
| `preserve_error_handlers` | `True` | Keep try/except blocks |
| `preserve_decorators` | `True` | Keep decorators |
| `docstring_mode` | `FIRST_LINE` | How to handle docstrings: `FULL`, `FIRST_LINE`, `REMOVE` |
| `target_compression_rate` | `0.2` | Fraction of tokens to keep (0.2 = keep 20%) |
| `max_body_lines` | `5` | Max lines to keep per function body |
| `min_tokens_for_compression` | `100` | Skip files smaller than this |
| `language_hint` | `None` | Override language detection |
| `fallback_to_llmlingua` | `True` | Use LLMLingua for unsupported languages |
Before and After [#before-and-after]
```python
# Before (full source file)
def process_data(items: List[str]) -> Dict[str, int]:
"""Process items and count occurrences."""
result = {}
for item in items:
item = item.strip().lower()
if item in result:
result[item] += 1
else:
result[item] = 1
return result
# After (signature preserved, body compressed)
def process_data(items: List[str]) -> Dict[str, int]:
"""Process items and count occurrences."""
result = {}
for item in items:
# ... (5 lines compressed)
pass
```
The LLM sees the function's purpose, its input/output types, and the general approach -- enough to reason about the code without needing every implementation line.
Installation [#installation]
```bash
# Install tree-sitter language pack
pip install "headroom-ai[code]"
```
Memory Management [#memory-management]
Tree-sitter parsers are lazy-loaded and cached. You can free memory when done:
```python
from headroom.transforms import is_tree_sitter_available, unload_tree_sitter
# Check if tree-sitter is installed
print(is_tree_sitter_available()) # True
# Free memory when done
unload_tree_sitter()
```
Performance [#performance]
| Metric | Value |
| --------------- | ---------------------------- |
| Compression | 40-70% token reduction |
| Speed | \~10-50ms per file |
| Memory | \~50MB (tree-sitter parsers) |
| Syntax validity | Guaranteed |
When you use the Headroom proxy or call `compress()`, source code is automatically detected and routed to CodeAwareCompressor. Direct usage gives you control over compression settings per language.
# Community Savings (/docs/community-savings)
Real-time aggregate metrics from Headroom proxy instances worldwide. All data is anonymous — only token counts, compression ratios, and cost estimates are collected. [Opt out anytime](https://github.com/chopratejas/headroom/blob/main/headroom/telemetry/beacon.py) with `HEADROOM_TELEMETRY=off`.
Overview [#overview]
Savings Over Time [#savings-over-time]
Top Savings by Instance [#top-savings-by-instance]
Instance Details [#instance-details]
# Configuration (/docs/configuration)
Headroom can be configured via the SDK constructor, proxy command line, environment variables, or per-request overrides.
Modes [#modes]
| Mode | Behavior | Use Case |
| ---------- | -------------------------------------- | ------------------------------------------- |
| `audit` | Observes and logs, no modifications | Production monitoring, baseline measurement |
| `optimize` | Applies safe, deterministic transforms | Production optimization |
| `simulate` | Returns plan without API call | Testing, cost estimation |
SDK Configuration [#sdk-configuration]
```ts twoslash
import { HeadroomClient } from 'headroom-ai';
// Reads from HEADROOM_BASE_URL and HEADROOM_API_KEY automatically
const client = new HeadroomClient();
// Or configure explicitly
const explicit = new HeadroomClient({
baseUrl: 'http://localhost:8787',
apiKey: 'your-api-key',
timeout: 30_000,
fallback: true,
retries: 2,
});
```
```python
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
# Mode: "audit" (observe only) or "optimize" (apply transforms)
default_mode="optimize",
# Enable provider-specific cache optimization
enable_cache_optimizer=True,
# Enable query-level semantic caching
enable_semantic_cache=False,
# Override default context limits per model
model_context_limits={
"gpt-4o": 128000,
"gpt-4o-mini": 128000,
},
# Database location (defaults to temp directory)
# store_url="sqlite:////absolute/path/to/headroom.db",
)
```
Per-Request Overrides [#per-request-overrides]
Override configuration for individual requests:
```ts twoslash
import { compress } from 'headroom-ai';
const result = await compress(messages, {
model: 'gpt-4o',
tokenBudget: 100_000,
timeout: 15_000,
});
```
```python
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
# Override mode for this request
headroom_mode="audit",
# Reserve more tokens for output
headroom_output_buffer_tokens=8000,
# Keep last N turns (don't compress)
headroom_keep_turns=5,
# Skip compression for specific tools
headroom_tool_profiles={
"important_tool": {"skip_compression": True}
},
)
```
SmartCrusher Configuration [#smartcrusher-configuration]
Fine-tune JSON compression behavior:
```python
from headroom.transforms import SmartCrusherConfig
config = SmartCrusherConfig(
# Maximum items to keep after compression
max_items_after_crush=15,
# Minimum tokens before applying compression
min_tokens_to_crush=200,
# Relevance scoring tier: "bm25" (fast) or "embedding" (accurate)
relevance_tier="bm25",
# Always keep items with these field values
preserve_fields=["error", "warning", "failure"],
)
```
CacheAligner Configuration [#cachealigner-configuration]
Control prefix stabilization for provider cache hit rates:
```python
from headroom.transforms import CacheAlignerConfig
config = CacheAlignerConfig(
# Enable/disable cache alignment
enabled=True,
# Patterns to extract from system prompt
dynamic_patterns=[
r"Today is \w+ \d+, \d{4}",
r"Current time: .*",
],
)
```
RollingWindow Configuration [#rollingwindow-configuration]
Control context window management when messages exceed model limits:
```python
from headroom.transforms import RollingWindowConfig
config = RollingWindowConfig(
# Minimum turns to always keep
min_keep_turns=3,
# Reserve tokens for output
output_buffer_tokens=4000,
# Drop oldest tool outputs first
prefer_drop_tool_outputs=True,
)
```
IntelligentContext Configuration [#intelligentcontext-configuration]
Semantic-aware context management with importance scoring:
```python
from headroom.config import IntelligentContextConfig, ScoringWeights
# Customize scoring weights (must sum to 1.0, or will be normalized)
weights = ScoringWeights(
recency=0.20, # Newer messages score higher
semantic_similarity=0.20, # Similarity to recent context
toin_importance=0.25, # TOIN-learned retrieval patterns
error_indicator=0.15, # TOIN-learned error field types
forward_reference=0.15, # Messages referenced by later messages
token_density=0.05, # Information density
)
config = IntelligentContextConfig(
enabled=True,
keep_system=True, # Never drop system messages
keep_last_turns=2, # Protect last N user turns
output_buffer_tokens=4000, # Reserve for model output
use_importance_scoring=True,
scoring_weights=weights,
toin_integration=True, # Use TOIN patterns if available
recency_decay_rate=0.1, # Exponential decay lambda
compress_threshold=0.1, # Try compression first if <10% over budget
)
```
Scoring Weights [#scoring-weights]
Weights are automatically normalized to sum to 1.0:
```python
weights = ScoringWeights(recency=1.0, toin_importance=1.0)
normalized = weights.normalized()
# recency=0.5, toin_importance=0.5, others=0.0
```
Proxy Configuration [#proxy-configuration]
Command Line Options [#command-line-options]
```bash
headroom proxy \
--port 8787 \ # Port to listen on
--host 0.0.0.0 \ # Host to bind to
--budget 10.00 \ # Daily budget limit in USD
--log-file headroom.jsonl # Log file path
```
Feature Flags [#feature-flags]
```bash
# Disable optimization (passthrough mode)
headroom proxy --no-optimize
# Disable semantic caching
headroom proxy --no-cache
# Enable LLMLingua ML compression
headroom proxy --llmlingua
headroom proxy --llmlingua --llmlingua-device cuda --llmlingua-rate 0.4
```
Environment Variables [#environment-variables]
| Variable | Description | Default |
| ----------------------- | ----------------------------------------------- | -------------------------------- |
| `HEADROOM_LOG_LEVEL` | Logging level | `INFO` |
| `HEADROOM_STORE_URL` | Database URL | temp directory |
| `HEADROOM_DEFAULT_MODE` | Default mode | `optimize` |
| `HEADROOM_MODEL_LIMITS` | Custom model config (JSON string or file path) | -- |
| `HEADROOM_BASE_URL` | Base URL of the Headroom proxy (TypeScript SDK) | `http://localhost:8787` |
| `HEADROOM_API_KEY` | API key for Headroom Cloud authentication | -- |
| `HEADROOM_SAVINGS_PATH` | Override persistent savings file location | `~/.headroom/proxy_savings.json` |
| `HEADROOM_TELEMETRY` | Set to `off` to disable anonymous telemetry | `on` |
Custom Model Configuration [#custom-model-configuration]
Configure context limits and pricing for new or custom models:
```json
{
"anthropic": {
"context_limits": {
"claude-4-opus-20250301": 200000,
"claude-custom-finetune": 128000
},
"pricing": {
"claude-4-opus-20250301": {
"input": 15.00,
"output": 75.00,
"cached_input": 1.50
}
}
},
"openai": {
"context_limits": {
"gpt-5": 256000,
"ft:gpt-4o:my-org": 128000
}
}
}
```
Save as `~/.headroom/models.json`, or set `HEADROOM_MODEL_LIMITS` to a JSON string or file path.
Settings are resolved in this order (later overrides earlier):
1. Built-in defaults
2. `~/.headroom/models.json` config file
3. `HEADROOM_MODEL_LIMITS` environment variable
4. SDK constructor arguments
Pattern-Based Inference [#pattern-based-inference]
Unknown models are automatically inferred from naming patterns:
| Pattern | Inferred Settings |
| ------------ | ------------------------------------- |
| `*opus*` | 200K context, Opus-tier pricing |
| `*sonnet*` | 200K context, Sonnet-tier pricing |
| `*haiku*` | 200K context, Haiku-tier pricing |
| `gpt-4o*` | 128K context, GPT-4o pricing |
| `o1*`, `o3*` | 200K context, reasoning model pricing |
Provider-Specific Settings [#provider-specific-settings]
```python
from headroom import OpenAIProvider
provider = OpenAIProvider(
enable_prefix_caching=True,
)
```
```python
from headroom import AnthropicProvider
provider = AnthropicProvider(
enable_cache_control=True,
)
```
```python
from headroom import GoogleProvider
provider = GoogleProvider(
enable_context_caching=True,
)
```
Tool Profiles [#tool-profiles]
Skip or customize compression for specific tools:
```python
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
headroom_tool_profiles={
"important_tool": {"skip_compression": True},
"search_tool": {"max_items_after_crush": 25},
},
)
```
Configuration Precedence [#configuration-precedence]
Settings are applied in this order (later overrides earlier):
1. Default values
2. Environment variables
3. SDK constructor arguments
4. Per-request overrides
Validation [#validation]
Validate your configuration at startup:
```python
result = client.validate_setup()
if not result["valid"]:
print("Configuration issues:")
for issue in result["issues"]:
print(f" - {issue}")
```
# Context Management (/docs/context-management)
When conversations grow beyond a model's context window, Headroom decides which messages to keep and which to drop. Instead of naively removing the oldest messages, **IntelligentContext** scores every message by learned importance and drops the least valuable ones first.
IntelligentContext [#intelligentcontext]
IntelligentContext is a message-level compressor. It analyzes your conversation, assigns an importance score to each message, and removes low-scoring messages until the conversation fits within the token budget.
Dropped messages are not lost -- they are stored in [CCR](/docs/ccr) for on-demand retrieval by the LLM.
```
100-message conversation (50K tokens) with a 32K budget
-> Score each message by importance
-> Drop 60 lowest-scoring messages
-> Cache dropped messages in CCR (hash=def456)
-> Insert marker: "60 messages dropped, retrieve: def456"
-> Final context: 40 messages within budget
```
Scoring weights [#scoring-weights]
Each message receives a weighted score from six factors:
| Weight | Default | Description |
| --------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `recency` | 0.20 | Exponential decay from the end of the conversation. Recent messages score higher. |
| `semantic_similarity` | 0.20 | Embedding cosine similarity to recent context. Messages related to the current topic score higher. |
| `toin_importance` | 0.25 | TOIN retrieval rate -- messages matching patterns that users frequently retrieve via CCR are scored higher. Learned across all users. |
| `error_indicator` | 0.15 | TOIN field semantics error detection. Messages containing error patterns (learned, not hardcoded) are preserved. |
| `forward_reference` | 0.15 | Count of later messages that reference this one. Messages that other messages depend on are kept. |
| `token_density` | 0.05 | Unique tokens divided by total tokens. Dense, information-rich messages score higher than repetitive ones. |
Error detection does not rely on keyword matching like "error" or "fail". Instead, it uses TOIN's learned `field_semantics.inferred_type` to identify error-bearing messages -- this adapts to your specific data patterns across sessions and users.
Weights are automatically normalized to sum to 1.0, so you can set relative values without worrying about exact proportions.
Rolling window fallback [#rolling-window-fallback]
If IntelligentContext is disabled or scoring data is unavailable, Headroom falls back to a **rolling window** strategy:
* Drop the oldest messages first
* Always keep the system prompt
* Always keep the last N user/assistant turns
* Drop tool calls and their responses as atomic pairs (no orphaned tool data)
This provides a safe baseline that works without any learned data.
Protection rules [#protection-rules]
Headroom enforces several protections to ensure model output quality:
Output buffer reservation [#output-buffer-reservation]
A configurable number of tokens is reserved for the model's response. The context budget is calculated as:
```
context_budget = model_context_limit - output_buffer_tokens
```
This prevents the input from consuming the entire context window and leaving no room for the model to respond.
System message protection [#system-message-protection]
System messages are never dropped. They contain critical instructions, persona definitions, and tool descriptions that the model needs throughout the conversation.
Turn protection [#turn-protection]
The last N user/assistant turns are always preserved, ensuring the model has immediate conversational context. By default, the last 2 turns are protected.
Configuration [#configuration]
```ts twoslash
import { compress } from "headroom-ai";
import type {
IntelligentContextConfig,
ScoringWeights,
RollingWindowConfig,
HeadroomConfig,
} from "headroom-ai";
// Scoring weights (normalized automatically)
const scoringWeights: ScoringWeights = {
recency: 0.20,
semanticSimilarity: 0.20,
toinImportance: 0.25,
errorIndicator: 0.15,
forwardReference: 0.15,
tokenDensity: 0.05,
};
// IntelligentContext configuration
const intelligentContext: IntelligentContextConfig = {
enabled: true,
keepSystem: true,
keepLastTurns: 2,
outputBufferTokens: 4000,
useImportanceScoring: true,
scoringWeights,
toinIntegration: true,
recencyDecayRate: 0.1,
compressThreshold: 0.1,
};
// Rolling window fallback
const rollingWindow: RollingWindowConfig = {
enabled: true,
keepSystem: true,
keepLastTurns: 3,
outputBufferTokens: 4000,
};
// Full configuration
const config: HeadroomConfig = {
intelligentContext,
rollingWindow,
};
const result = await compress(messages, {
model: "gpt-4o",
config,
});
console.log(`Compressed: ${result.tokensBefore} -> ${result.tokensAfter}`);
```
```python
from headroom import HeadroomClient, OpenAIProvider
from headroom.config import IntelligentContextConfig, ScoringWeights
from openai import OpenAI
# Customize scoring weights
weights = ScoringWeights(
recency=0.20,
semantic_similarity=0.20,
toin_importance=0.25,
error_indicator=0.15,
forward_reference=0.15,
token_density=0.05,
)
context_config = IntelligentContextConfig(
enabled=True,
keep_system=True, # Never drop system messages
keep_last_turns=2, # Protect last 2 user turns
output_buffer_tokens=4000, # Reserve for model output
use_importance_scoring=True,
scoring_weights=weights,
toin_integration=True, # Use TOIN patterns
recency_decay_rate=0.1, # Exponential decay lambda
compress_threshold=0.1, # Try compression first if <10% over budget
)
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
)
# Per-request overrides
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
headroom_output_buffer_tokens=8000, # More room for long responses
headroom_keep_turns=5, # Protect last 5 turns
)
```
How scoring improves over time [#how-scoring-improves-over-time]
IntelligentContext integrates with TOIN (Tool-Output Intelligence Network) to learn from real usage:
1. Messages are dropped based on current scores
2. Dropped messages are stored in CCR
3. If the LLM retrieves a dropped message, TOIN records that pattern
4. Future conversations score similar message patterns higher
5. Drop accuracy improves across all users, not just within one session
This feedback loop means the system gets smarter the more it is used. Error messages that users frequently need are automatically preserved, while verbose success messages that nobody retrieves are dropped more aggressively.
# Error Handling (/docs/errors)
Headroom provides explicit exceptions for debugging, with a core safety guarantee: **compression failures never break your LLM calls**. If compression fails, the original content passes through unchanged.
Error Hierarchy [#error-hierarchy]
```
HeadroomError (base class)
+-- HeadroomConnectionError # Cannot reach proxy
+-- HeadroomAuthError # 401 from proxy
+-- HeadroomCompressError # Compression failed (with statusCode)
+-- ConfigurationError # Invalid configuration
+-- ProviderError # Provider issues
+-- StorageError # Storage failures
+-- TokenizationError # Token counting failed
+-- CacheError # Cache operations failed
+-- ValidationError # Validation failures
+-- TransformError # Transform execution failed
```
```ts twoslash
import {
HeadroomError,
HeadroomConnectionError,
HeadroomAuthError,
HeadroomCompressError,
ConfigurationError,
ProviderError,
mapProxyError,
} from 'headroom-ai';
```
```
HeadroomError (base class)
+-- ConfigurationError # Invalid configuration
+-- ProviderError # Provider issues (unknown model, etc.)
+-- StorageError # Database/storage failures
+-- CompressionError # Compression failures (rare)
+-- ValidationError # Setup validation failures
```
```python
from headroom import (
HeadroomError,
ConfigurationError,
ProviderError,
StorageError,
CompressionError,
ValidationError,
)
```
Catching Errors [#catching-errors]
```ts twoslash
import { compress, HeadroomConnectionError, HeadroomAuthError, HeadroomCompressError, HeadroomError } from 'headroom-ai';
try {
const result = await compress(messages, { model: 'gpt-4o' });
} catch (e) {
if (e instanceof HeadroomConnectionError) {
console.error('Cannot reach proxy:', e.message);
} else if (e instanceof HeadroomAuthError) {
console.error('Auth failed:', e.message);
} else if (e instanceof HeadroomCompressError) {
console.error(`Compress failed (${e.statusCode}):`, e.message);
} else if (e instanceof HeadroomError) {
console.error('Headroom error:', e.message, e.details);
}
}
```
```python
from headroom import (
HeadroomClient,
HeadroomError,
ConfigurationError,
StorageError,
)
try:
client = HeadroomClient(...)
response = client.chat.completions.create(...)
except ConfigurationError as e:
print(f"Config issue: {e}")
print(f"Details: {e.details}")
except StorageError as e:
print(f"Storage issue: {e}")
# Headroom continues to work, just without metrics persistence
except HeadroomError as e:
print(f"Headroom error: {e}")
```
Error Types in Detail [#error-types-in-detail]
ConfigurationError [#configurationerror]
Raised when configuration is invalid.
```ts twoslash
import { ConfigurationError } from 'headroom-ai';
// ConfigurationError is thrown when the proxy returns
// a configuration_error type in its error response
```
```python
try:
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="invalid_mode", # Will raise ConfigurationError
)
except ConfigurationError as e:
print(f"Config error: {e}")
print(f"Field: {e.details.get('field')}")
```
ProviderError [#providererror]
Raised for provider-specific issues (unknown model, API error, token counting failure).
```python
try:
response = client.chat.completions.create(
model="unknown-model-xyz",
messages=[...],
)
except ProviderError as e:
print(f"Provider error: {e}")
print(f"Provider: {e.details.get('provider')}")
```
StorageError [#storageerror]
Raised when database operations fail. Storage errors do not affect core functionality -- the application can continue without historical metrics.
```python
try:
metrics = client.get_metrics()
except StorageError as e:
metrics = [] # Continue without historical metrics
```
CompressionError [#compressionerror]
Raised when compression fails (rare). In practice, compression errors are caught internally and the original content passes through unchanged. This exception is only raised in strict mode.
HeadroomConnectionError (TypeScript) [#headroomconnectionerror-typescript]
Raised when the TypeScript SDK cannot connect to the Headroom proxy.
```ts twoslash
import { compress, HeadroomConnectionError } from 'headroom-ai';
try {
await compress(messages, { model: 'gpt-4o' });
} catch (e) {
if (e instanceof HeadroomConnectionError) {
console.error('Is the proxy running? Start with: headroom proxy');
}
}
```
Proxy Error Mapping [#proxy-error-mapping]
The TypeScript SDK automatically maps proxy error responses to the correct error class:
| HTTP Status | Proxy Error Type | TypeScript Class |
| ----------- | --------------------- | ----------------------- |
| 401 | -- | `HeadroomAuthError` |
| 4xx/5xx | `configuration_error` | `ConfigurationError` |
| 4xx/5xx | `provider_error` | `ProviderError` |
| 4xx/5xx | `storage_error` | `StorageError` |
| 4xx/5xx | `tokenization_error` | `TokenizationError` |
| 4xx/5xx | `validation_error` | `ValidationError` |
| 4xx/5xx | `transform_error` | `TransformError` |
| 4xx/5xx | (other) | `HeadroomCompressError` |
The `mapProxyError()` function handles this mapping:
```ts twoslash
import { mapProxyError } from 'headroom-ai';
const error = mapProxyError(400, 'configuration_error', 'Invalid mode');
// Returns a ConfigurationError instance
```
Error Details [#error-details]
All Headroom exceptions include a `details` dict/object with additional context:
```ts twoslash
import { HeadroomError } from 'headroom-ai';
// HeadroomError.details is Record | undefined
// HeadroomCompressError also has .statusCode and .errorType
```
```python
try:
client = HeadroomClient(...)
except HeadroomError as e:
print(f"Error: {e}")
print(f"Type: {type(e).__name__}")
print(f"Details: {e.details}")
# Details might include:
# - field: which config field caused the error
# - provider: which provider was involved
# - model: which model was requested
# - original_error: underlying exception
```
Safety Guarantee [#safety-guarantee]
If compression fails, the original content passes through unchanged. Your LLM calls never fail due to Headroom:
```python
messages = [
{"role": "tool", "content": "malformed json {{{"}
]
# This will NOT raise an exception
# The malformed content passes through unchanged
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
```
Best Practices [#best-practices]
1. **Catch specific exceptions** rather than broad `Exception` to avoid hiding real bugs
2. **Let StorageError pass** -- storage errors do not affect core compression functionality
3. **Validate on startup** with `client.validate_setup()` to catch configuration issues early
4. **Enable logging** at WARNING level to see when compression is skipped
```python
import logging
logging.basicConfig(level=logging.WARNING)
# WARNING:headroom.transforms.smart_crusher:Skipping compression: invalid JSON
```
# Failure Learning (/docs/failure-learning)
`headroom learn` analyzes past coding agent sessions, finds what went wrong, correlates each failure with what eventually worked, and writes specific project-level learnings that prevent the same mistakes next session.
Quick Start [#quick-start]
```bash
# See recommendations for current project (dry-run, no changes)
headroom learn
# Write recommendations to CLAUDE.md and MEMORY.md
headroom learn --apply
# Analyze a specific project
headroom learn --project ~/my-project --apply
# Analyze all projects
headroom learn --all --apply
```
Success Correlation [#success-correlation]
The core innovation. Instead of cataloging failures ("Read failed 5 times"), Headroom finds what the model did to **fix** each failure:
* **Failed**: `Read axion-formats/src/main/java/.../FirstClassEntity.java`
* **Then succeeded**: `Read axion-scala-common/src/main/scala/.../FirstClassEntity.scala`
* **Learning**: "`FirstClassEntity` is at `axion-scala-common/`, not `axion-formats/`"
This produces specific, actionable corrections -- not generic advice.
What It Learns [#what-it-learns]
Environment Facts [#environment-facts]
Which runtime commands work vs fail.
```markdown
### Environment
- **Python**: use `uv run python` (not `python3` -- modules not available outside venv)
```
File Path Corrections [#file-path-corrections]
Wrong paths the model keeps guessing, with the correct locations.
```markdown
### File Path Corrections
- `axion-common/src/.../AxionSparkConstants.scala`
-> actually at `axion-spark-common/src/.../AxionSparkConstants.scala`
```
Search Scope [#search-scope]
Which directories to search in (narrow paths fail, broader ones work).
```markdown
### Search Scope
- Don't search `axion-model/` -> use `axion/` (the repo root)
```
Command Patterns [#command-patterns]
How commands should (and should not) be run.
```markdown
### Command Patterns
- **user_prefers_manual**: User rejected gradle 18 times -- show the command, don't execute
- **python_runtime**: Use `uv run python` not `python3` (ModuleNotFoundError)
```
Known Large Files [#known-large-files]
Files that need `offset`/`limit` with Read.
```markdown
### Known Large Files
- `proxy/server.py` (~8000 lines) -- always use offset/limit
```
Where Learnings Go [#where-learnings-go]
| Pattern | Destination | Why |
| ------------------------------------------------------- | ------------- | ------------------------------------------ |
| Environment, paths, search scope, commands, large files | **CLAUDE.md** | Stable project facts, version-controllable |
| Missing paths, retry patterns, permissions | **MEMORY.md** | May change, agent-specific |
CLAUDE.md lives in your project directory. MEMORY.md lives in `~/.claude/projects/*/memory/`.
Marker-Based Updates [#marker-based-updates]
Headroom manages a clearly-delimited section in each file:
```markdown
## Headroom Learned Patterns
*Auto-generated by `headroom learn` -- do not edit manually*
...
```
On re-run, only the content between markers is replaced. Your existing file content is preserved.
Architecture [#architecture]
The system is built with an adapter pattern so it can support multiple agent systems:
* **Scanners** read tool-specific log formats (e.g., `~/.claude/projects/*.jsonl`) and produce normalized `ToolCall` sequences
* **Analyzers** work on `ToolCall` data -- same analysis logic for any agent system
* **Writers** output to tool-specific context injection mechanisms (e.g., CLAUDE.md)
To add support for a new agent (e.g., Cursor), you write a Scanner that reads its log format and a Writer that outputs to `.cursorrules`. The analyzers stay the same.
CLI Reference [#cli-reference]
```bash
headroom learn [OPTIONS]
Options:
--project PATH Project directory to analyze (default: current directory)
--all Analyze all discovered projects
--apply Write recommendations (default: dry-run)
--claude-dir PATH Path to .claude directory (default: ~/.claude)
```
Real-World Results [#real-world-results]
Tested on 67,583 tool calls across 23 projects:
| Metric | Value |
| ------------------------ | --------------------- |
| Failure rate | 7.5% (5,066 failures) |
| Corrections extracted | 164 per project (avg) |
| Path corrections | 22 (axion project) |
| Search scope corrections | 24 (axion project) |
| Command patterns learned | 5 (axion project) |
# How Compression Works (/docs/how-compression-works)
Headroom automatically detects what kind of content you're sending and routes it to the right compressor. You don't need to configure anything -- just call `compress()` and the pipeline handles the rest.
The Three-Stage Pipeline [#the-three-stage-pipeline]
Every request flows through three stages:
```
┌──────────────┐ ┌────────────────┐ ┌─────────────────────┐
│ CacheAligner │────>│ ContentRouter │────>│ IntelligentContext │
│ │ │ │ │ │
│ Stabilize │ │ Detect type & │ │ Score messages & │
│ prefix for │ │ route to best │ │ fit within token │
│ cache hits │ │ compressor │ │ budget │
└──────────────┘ └────────────────┘ └─────────────────────┘
```
1. **CacheAligner** extracts dynamic content (dates, user context) from your system prompt so the static prefix stays cacheable across requests.
2. **ContentRouter** inspects each tool output and routes it to the optimal compressor -- SmartCrusher for JSON arrays, CodeAwareCompressor for source code, LogCompressor for build output, and so on.
3. **IntelligentContext** scores every message by importance (recency, semantic relevance, error indicators) and drops the lowest-value messages to fit within the model's context window.
Content Type Detection [#content-type-detection]
The router auto-detects content type by analyzing structure and patterns. No manual hints required.
| Content Type | Detection Signal | Compressor | Typical Savings |
| --------------- | ------------------------------------------ | ------------------- | --------------- |
| JSON arrays | Valid JSON with array elements | SmartCrusher | 70-90% |
| Source code | Syntax patterns, indentation, keywords | CodeAwareCompressor | 40-70% |
| Search results | `file:line:content` format | SearchCompressor | 80-95% |
| Build/test logs | Timestamps, log levels, pytest/npm markers | LogCompressor | 85-95% |
| Diffs | Unified diff format | DiffCompressor | 60-80% |
| HTML | Tag structure | HTMLCompressor | 50-70% |
| Plain text | Fallback | TextCompressor | 60-80% |
Quick Start [#quick-start]
```ts twoslash
import { compress } from "headroom-ai";
const messages = [
{ role: "system" as const, content: "You are a helpful assistant." },
{ role: "user" as const, content: "Summarize this data" },
{ role: "tool" as const, content: '{"results": [...]}', tool_call_id: "call_1" },
];
const result = await compress(messages);
console.log(`Tokens saved: ${result.tokensSaved}`);
console.log(`Compression ratio: ${result.compressionRatio}`);
```
```python
from headroom.compression import compress
result = compress(content)
print(result.compressed)
print(f"Saved {result.savings_percentage:.0f}% tokens")
```
Configuring the Compressor [#configuring-the-compressor]
```ts twoslash
import { compress } from "headroom-ai";
const result = await compress(messages, {
model: "gpt-4o",
tokenBudget: 50000,
});
console.log(`Before: ${result.tokensBefore} tokens`);
console.log(`After: ${result.tokensAfter} tokens`);
console.log(`Transforms: ${result.transformsApplied.join(", ")}`);
```
```python
from headroom.compression import UniversalCompressor, UniversalCompressorConfig
config = UniversalCompressorConfig(
compression_ratio_target=0.5, # Keep 50% of content
use_entropy_preservation=True, # Preserve UUIDs, hashes
use_magika=True, # ML-based content detection
ccr_enabled=True, # Store originals for retrieval
)
compressor = UniversalCompressor(config=config)
result = compressor.compress(content)
print(f"Type: {result.content_type}")
print(f"Handler: {result.handler_used}")
print(f"Saved: {result.savings_percentage:.0f}%")
```
Structure Preservation [#structure-preservation]
Headroom doesn't blindly truncate. It identifies what matters in each content type and preserves it:
| Content Type | What's Preserved | What's Compressed |
| ------------ | ------------------------------------------------------ | ---------------------------------- |
| **JSON** | Keys, brackets, booleans, nulls, short values, UUIDs | Long string values, whitespace |
| **Code** | Imports, function signatures, class definitions, types | Function bodies, comments |
| **Logs** | Timestamps, log levels, error messages, stack traces | Repeated patterns, verbose details |
| **Text** | High-entropy tokens (IDs, hashes), headers | Low-information content |
Real Compression Ratios [#real-compression-ratios]
| Content Type | Compression | Speed | What's Preserved |
| -------------------- | ----------- | ------ | -------------------- |
| JSON (large arrays) | 70-90% | \~1ms | All keys, structure |
| Source code (Python) | 50-70% | \~10ms | Signatures, imports |
| Search results | 80-95% | \~2ms | Relevant matches |
| Build logs | 85-95% | \~3ms | Errors, stack traces |
| Plain text | 60-80% | \~5ms | High-entropy tokens |
Batch Compression [#batch-compression]
For multiple contents, batch compression is more efficient:
```python
from headroom.compression import UniversalCompressor
compressor = UniversalCompressor()
contents = [
'{"users": [...]}',
'def hello(): pass',
'Plain text content',
]
results = compressor.compress_batch(contents)
for result in results:
print(f"{result.content_type}: {result.savings_percentage:.0f}% saved")
```
What Happens Under the Hood [#what-happens-under-the-hood]
When you call `compress()`, here is the full sequence:
1. **Content detection** -- Magika (ML-based) or pattern matching identifies the content type
2. **Structure extraction** -- A handler extracts a structure mask marking what to preserve
3. **Compression** -- Non-structural content is compressed (SmartCrusher, LLMLingua, or text utilities)
4. **CCR storage** -- If enabled, the original is stored for retrieval when the LLM needs full context
The pipeline works out of the box with no configuration. All detection, routing, and compression happens automatically. Configuration is available when you need fine-grained control.
# Image Compression (/docs/image-compression)
Vision models charge by the token, and images are expensive. A single 1024x1024 image costs \~765 tokens on OpenAI. Headroom's image compression uses a trained ML router to analyze your query and automatically select the optimal compression technique, saving 40-90% of image tokens.
How It Works [#how-it-works]
```
User uploads image + asks question
|
[Query Analysis]
TrainedRouter (MiniLM from HuggingFace)
Classifies: "What animal is this?" -> full_low
|
[Image Analysis]
SigLIP analyzes image properties
(has text? complex? fine details?)
|
[Apply Compression]
OpenAI: detail="low"
Anthropic: Resize to 512px
Google: Resize to 768px
|
Compressed request to LLM
```
The router is a fine-tuned MiniLM classifier (`chopratejas/technique-router` on HuggingFace) with 93.7% accuracy across 1,157 training examples.
Compression Techniques [#compression-techniques]
| Technique | Savings | When Used | Example Query |
| ----------- | ------- | ----------------------- | -------------------------------------------------- |
| `full_low` | \~87% | General understanding | "What is this?", "Describe the scene" |
| `preserve` | 0% | Fine details needed | "Count the whiskers", "Read the serial number" |
| `crop` | 50-90% | Region-specific queries | "What's in the corner?", "Focus on the background" |
| `transcode` | \~99% | Text extraction | "Read the sign", "Transcribe the document" |
Quick Start [#quick-start]
With Headroom Proxy (Zero Code Changes) [#with-headroom-proxy-zero-code-changes]
```bash
# Start the proxy
headroom proxy --port 8787
# Connect your client -- images are compressed automatically
ANTHROPIC_BASE_URL=http://localhost:8787 claude
```
With HeadroomClient [#with-headroomclient]
```python
from headroom import HeadroomClient
client = HeadroomClient(provider="openai")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What animal is this?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}]
)
# Image automatically compressed with detail="low" (87% savings)
```
Direct API [#direct-api]
```python
from headroom.image import ImageCompressor
compressor = ImageCompressor()
# Compress images in messages
compressed_messages = compressor.compress(messages, provider="openai")
# Check savings
print(f"Saved {compressor.last_savings:.0f}% tokens")
print(f"Technique: {compressor.last_result.technique.value}")
```
Provider Support [#provider-support]
The compressor adapts its strategy per provider:
| Provider | Compression Method | Details |
| ----------------- | ------------------- | ------------------------------------------ |
| **OpenAI** | Sets `detail="low"` | Native detail parameter |
| **Anthropic** | Resizes to 512px | PIL-based resize |
| **Google Gemini** | Resizes to 768px | Optimized for Gemini's 768x768 tile system |
Token Savings by Provider [#token-savings-by-provider]
**OpenAI** (1024x1024 image):
| Technique | Before | After | Savings |
| ---------- | ---------- | ---------- | ------- |
| `full_low` | 765 tokens | 85 tokens | 89% |
| `preserve` | 765 tokens | 765 tokens | 0% |
**Anthropic** (1024x1024 image):
| Before | After | Savings |
| -------------- | ------------ | ------- |
| \~1,398 tokens | \~349 tokens | 75% |
**Google Gemini** (1536x1536 image):
| Before | After | Savings |
| ---------------------- | ------------------- | ------- |
| 1,032 tokens (4 tiles) | 258 tokens (1 tile) | 75% |
Configuration [#configuration]
```python
from headroom.image import ImageCompressor
compressor = ImageCompressor(
model_id="chopratejas/technique-router", # HuggingFace model
use_siglip=True, # Enable image analysis
device="cuda", # Use GPU if available (auto, cuda, cpu, mps)
)
```
Proxy Configuration [#proxy-configuration]
```bash
# Enable image compression (default)
headroom proxy --image-optimize
# Disable image compression
headroom proxy --no-image-optimize
```
Performance [#performance]
| Metric | Value |
| ------------------- | ------------------------------------ |
| Router inference | \~10ms (CPU), \~2ms (GPU) |
| Image resize | \~5-20ms |
| First request | +2-3s (model download, cached after) |
| Router accuracy | 93.7% |
| Model size | \~128MB |
| GPU memory (SigLIP) | \~400MB |
When using the Headroom proxy, image compression happens automatically on every request that contains images. No code changes needed.
# Introduction (/docs)
Headroom compresses everything your AI agent reads -- tool outputs, database results, file reads, RAG retrievals, API responses -- before it reaches the LLM. The model sees less noise, responds faster, and costs less.
Quick preview [#quick-preview]
```ts twoslash
import { compress } from 'headroom-ai';
const messages = [
{ role: 'user' as const, content: 'Analyze these results' },
];
const result = await compress(messages, { model: 'gpt-4o' });
console.log(`Saved ${result.tokensSaved} tokens (${(result.compressionRatio * 100).toFixed(0)}%)`);
```
```python
from headroom import compress
result = compress(messages, model="gpt-4o")
response = client.messages.create(
model="gpt-4o",
messages=result.messages,
)
print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")
```
Community stats [#community-stats]
What gets compressed [#what-gets-compressed]
| Content type | What happens | Typical savings |
| -------------------------- | ------------------------------------------------------------ | --------------- |
| JSON arrays (tool outputs) | Statistical analysis keeps errors, anomalies, boundaries | 70--90% |
| Source code | AST-aware compression preserves signatures, collapses bodies | 40--70% |
| Build/test logs | Keeps failures and errors, drops passing noise | 80--95% |
| Search results | Ranks by relevance, keeps top matches | 60--80% |
| Plain text | ModernBERT token classification removes redundancy | 30--50% |
| Git diffs | Preserves change hunks, drops unchanged context | 40--60% |
| Images | ML router selects optimal resize/quality tradeoff | 40--90% |
Where Headroom fits [#where-headroom-fits]
```
Your Agent / App
|
| tool outputs, logs, DB reads, RAG results, file reads, API responses
v
Headroom <-- proxy, Python library, TS SDK, or framework integration
|
v
LLM Provider (OpenAI, Anthropic, Google, Bedrock, 100+ via LiteLLM)
```
Headroom works as a **transparent proxy** (zero code changes), a **Python function** (`compress()`), a **TypeScript function** (`compress()`), or a **framework integration** (LangChain, Agno, Strands, LiteLLM, Vercel AI SDK, MCP).
Real-world results [#real-world-results]
**100 production log entries. One critical error buried at position 67.**
| Metric | Baseline | Headroom |
| --------------- | -------- | -------- |
| Input tokens | 10,144 | 1,260 |
| Correct answers | **4/4** | **4/4** |
87.6% fewer tokens. Same answer. The FATAL error was automatically preserved -- not by keyword matching, but by statistical analysis of field variance.
| Scenario | Before | After | Savings |
| ------------------------- | ------ | ------ | ------- |
| Code search (100 results) | 17,765 | 1,408 | **92%** |
| SRE incident debugging | 65,694 | 5,118 | **92%** |
| Codebase exploration | 78,502 | 41,254 | **47%** |
| GitHub issue triage | 54,174 | 14,761 | **73%** |
Key Features [#key-features]
Framework Integrations [#framework-integrations]
Nothing is lost [#nothing-is-lost]
Compressed content goes into the CCR store (Compress-Cache-Retrieve). The LLM gets a `headroom_retrieve` tool and can fetch full originals when it needs more detail. Compression is aggressive but reversible.
Next steps [#next-steps]
# Installation (/docs/installation)
Python [#python]
Headroom requires **Python 3.10+** and is published as `headroom-ai` on PyPI.
Core package [#core-package]
```bash
pip install headroom-ai
```
The core package includes the `compress()` function, SmartCrusher, CacheAligner, and IntelligentContext. No heavy dependencies.
Extras [#extras]
Install only what you need, or grab everything with `[all]`:
```bash
pip install "headroom-ai[all]"
```
| Extra | What it adds | Install command |
| ----------- | ----------------------------------------------------------------------------- | -------------------------------------- |
| `proxy` | Proxy server, MCP tools, HTTP API | `pip install "headroom-ai[proxy]"` |
| `ml` | Kompress (ModernBERT text compression, requires PyTorch) | `pip install "headroom-ai[ml]"` |
| `code` | CodeCompressor (tree-sitter AST parsing) | `pip install "headroom-ai[code]"` |
| `mcp` | MCP server tools (`headroom_compress`, `headroom_retrieve`, `headroom_stats`) | `pip install "headroom-ai[mcp]"` |
| `langchain` | LangChain `HeadroomChatModel` wrapper | `pip install "headroom-ai[langchain]"` |
| `agno` | Agno `HeadroomAgnoModel` wrapper | `pip install "headroom-ai[agno]"` |
| `evals` | Evaluation framework (GSM8K, SQuAD, BFCL benchmarks) | `pip install "headroom-ai[evals]"` |
| `all` | Everything above | `pip install "headroom-ai[all]"` |
You can combine extras:
```bash
pip install "headroom-ai[proxy,langchain,ml]"
```
Verify the install [#verify-the-install]
```bash
python -c "import headroom; print(headroom.__version__)"
```
TypeScript / Node.js [#typescript--nodejs]
The TypeScript SDK is published as `headroom-ai` on npm. It requires **Node.js 18+**.
```bash
npm install headroom-ai
```
Or with other package managers:
```bash
pnpm add headroom-ai
yarn add headroom-ai
```
The TypeScript SDK sends messages to the Headroom proxy over HTTP for compression. The proxy runs the full compression pipeline (Python). Start it before using the SDK:
```bash
pip install "headroom-ai[proxy]"
headroom proxy --port 8787
```
Then point the SDK at it:
```ts
import { compress } from 'headroom-ai';
const result = await compress(messages, {
baseUrl: 'http://localhost:8787',
});
```
Verify the install [#verify-the-install-1]
```bash
node -e "const h = require('headroom-ai'); console.log('headroom-ai loaded')"
```
Docker [#docker]
Pre-built images are published to GitHub Container Registry on every release.
```bash
docker pull ghcr.io/chopratejas/headroom:latest
docker run -p 8787:8787 ghcr.io/chopratejas/headroom:latest
```
Image tags [#image-tags]
| Tag | Extras | Base image | Description |
| ------------------- | ------------ | ----------- | ----------------------------------------- |
| `latest` | `proxy` | Debian slim | Default image, runs the proxy |
| `` | `proxy` | Debian slim | Pinned version |
| `nonroot` | `proxy` | Debian slim | Runs as non-root user |
| `code` | `proxy,code` | Debian slim | Includes tree-sitter for code compression |
| `code-nonroot` | `proxy,code` | Debian slim | Code compression, non-root |
| `slim` | `proxy` | Distroless | Minimal image, no shell |
| `slim-nonroot` | `proxy` | Distroless | Minimal, non-root |
| `code-slim` | `proxy,code` | Distroless | Code compression, minimal |
| `code-slim-nonroot` | `proxy,code` | Distroless | Code compression, minimal, non-root |
Build from source [#build-from-source]
Use Docker Bake for multi-variant builds:
```bash
# List all targets
docker buildx bake --list targets
# Build the default runtime image
docker buildx bake runtime-default
# Build a specific variant with custom registry
docker buildx bake runtime-code-slim-nonroot \
--set '*.tags=my-registry/headroom:code-slim-nonroot'
```
Environment variables [#environment-variables]
These variables configure Headroom at runtime. Set them in your shell, `.env` file, or container environment.
LLM provider keys [#llm-provider-keys]
| Variable | Description |
| --------------------------------------------- | --------------------------------------------------- |
| `OPENAI_API_KEY` | OpenAI API key (used when proxying to OpenAI) |
| `ANTHROPIC_API_KEY` | Anthropic API key (used when proxying to Anthropic) |
| `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` | AWS credentials for Bedrock backend |
| `GOOGLE_APPLICATION_CREDENTIALS` | Google Cloud credentials for Vertex AI backend |
Proxy configuration [#proxy-configuration]
| Variable | Default | Description |
| -------------------- | ---------- | --------------------------------------------------- |
| `HEADROOM_PORT` | `8787` | Port the proxy listens on |
| `HEADROOM_HOST` | `0.0.0.0` | Host the proxy binds to |
| `HEADROOM_MODE` | `optimize` | Default mode: `optimize`, `audit`, or `passthrough` |
| `HEADROOM_LOG_LEVEL` | `INFO` | Logging level |
TypeScript SDK [#typescript-sdk]
| Variable | Default | Description |
| ------------------- | ----------------------- | ---------------------------------- |
| `HEADROOM_BASE_URL` | `http://localhost:8787` | Proxy URL for the TypeScript SDK |
| `HEADROOM_API_KEY` | *(none)* | API key if the proxy requires auth |
Next steps [#next-steps]
# LangChain (/docs/langchain)
Headroom integrates with LangChain to compress context across all LangChain patterns: chat models, memory, retrievers, agents, and streaming.
Installation [#installation]
```bash
pip install "headroom-ai[langchain]"
```
Quick start [#quick-start]
Wrap any chat model in one line:
```python
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
# Use exactly like before
response = llm.invoke("Hello!")
# Check savings
print(llm.get_metrics())
# {'tokens_saved': 12500, 'savings_percent': 45.2, 'requests': 50}
```
Works with any provider:
```python
from langchain_anthropic import ChatAnthropic
llm = HeadroomChatModel(ChatAnthropic(model="claude-sonnet-4-20250514"))
```
Memory integration [#memory-integration]
`HeadroomChatMessageHistory` wraps any chat history with automatic compression. Long conversations stay under your token budget:
```python
from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory
base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
base_history,
compress_threshold_tokens=4000, # Compress when over 4K tokens
keep_recent_turns=5, # Always keep last 5 turns
)
memory = ConversationBufferMemory(chat_memory=compressed_history)
```
After usage:
```python
print(compressed_history.get_compression_stats())
# {'compression_count': 12, 'total_tokens_saved': 28000}
```
Retriever integration [#retriever-integration]
`HeadroomDocumentCompressor` filters retrieved documents by relevance. Retrieve many for recall, keep the best for precision:
```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import FAISS
from headroom.integrations import HeadroomDocumentCompressor
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})
compressor = HeadroomDocumentCompressor(
max_documents=10,
min_relevance=0.3,
prefer_diverse=True, # MMR-style diversity
)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever,
)
# Retrieves 50 docs, returns best 10
docs = retriever.invoke("What is Python?")
```
Agent tool wrapping [#agent-tool-wrapping]
`wrap_tools_with_headroom` compresses tool outputs before they re-enter the agent's context:
```python
from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom
@tool
def search_database(query: str) -> str:
"""Search the database."""
return json.dumps({"results": [...], "total": 1000})
wrapped_tools = wrap_tools_with_headroom(
[search_database],
min_chars_to_compress=1000,
)
agent = create_openai_tools_agent(llm, wrapped_tools, prompt)
executor = AgentExecutor(agent=agent, tools=wrapped_tools)
```
Per-tool metrics:
```python
from headroom.integrations import get_tool_metrics
metrics = get_tool_metrics()
print(metrics.get_summary())
# {'total_invocations': 25, 'total_compressions': 18, 'total_chars_saved': 450000}
```
LangGraph ReAct agent [#langgraph-react-agent]
```python
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from headroom.integrations import HeadroomChatModel, wrap_tools_with_headroom
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
tools = wrap_tools_with_headroom([search_web, query_database])
agent = create_react_agent(llm, tools)
result = agent.invoke({
"messages": [("user", "Find users who signed up last week")]
})
```
LangGraph custom graph [#langgraph-custom-graph]
Insert a compression node between tools and the agent in a custom `StateGraph`:
```python
from langgraph.graph import StateGraph, MessagesState, START, END
from headroom.integrations.langchain import create_compress_tool_messages_node
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tools_node)
graph.add_node("compress", create_compress_tool_messages_node(
min_tokens_to_compress=100,
))
# Wire: tools -> compress -> agent
graph.add_edge(START, "agent")
graph.add_edge("tools", "compress")
graph.add_edge("compress", "agent")
```
Streaming [#streaming]
Full async support:
```python
# Async invoke
response = await llm.ainvoke("Hello!")
# Async streaming
async for chunk in llm.astream("Tell me a story"):
print(chunk.content, end="", flush=True)
```
Custom configuration [#custom-configuration]
```python
from headroom import HeadroomConfig, HeadroomMode
config = HeadroomConfig(
default_mode=HeadroomMode.OPTIMIZE,
smart_crusher_target_ratio=0.3,
)
llm = HeadroomChatModel(
ChatOpenAI(model="gpt-4o"),
headroom_config=config,
)
```
# Limitations (/docs/limitations)
Headroom is designed to compress LLM context without losing accuracy. This page documents when it helps, when it does not, and the safety gates that prevent harmful compression.
When Headroom Helps vs. Does Not [#when-headroom-helps-vs-does-not]
| Content Type | Compression | Latency Impact | Best For |
| ------------------------------------------------------------------ | ----------- | -------------------------------- | --------------------------- |
| **JSON: Arrays of dicts** (search results, API responses, DB rows) | 86--100% | Net latency win on Sonnet/Opus | Primary use case |
| **JSON: Arrays of strings** (file paths, log lines, tags) | 60--90% | Net latency win | String dedup + sampling |
| **JSON: Arrays of numbers** (metrics, time series) | 70--85% | Net latency win | Statistical summary |
| **JSON: Mixed-type arrays** | 50--70% | Net latency win | Group-by-type compression |
| **Structured logs** (as JSON) | 82--95% | Net latency win | Log entries in tool outputs |
| **Agentic conversations** (25--50 turns) | 56--81% | Break-even to net win | Multi-tool agent sessions |
| **Plain text** (documentation, articles) | 43--46% | Adds latency (cost savings only) | Cost optimization |
| **Code** | Passthrough | Minimal overhead | See below |
| **RAG document contexts** | Passthrough | Minimal overhead | Not compressed |
Where Headroom Adds the Most Value [#where-headroom-adds-the-most-value]
* Long agent sessions with accumulated tool outputs (40--80% compression)
* JSON-heavy workflows -- API responses, database queries (83--94% compression)
* Build and test output (85--94% compression)
* Multi-tool agents (60--76% compression across tool results)
Where Headroom Adds Little Value [#where-headroom-adds-little-value]
* Short conversational exchanges (median 4.8% compression)
* Code-only sessions (reading/writing files) -- code passes through
* Single-turn requests with no accumulated context
What Headroom Does NOT Compress [#what-headroom-does-not-compress]
* **Short messages** (\< 300 tokens) -- overhead exceeds savings
* **Source code** -- passes through unchanged to preserve correctness
* **grep/search results** -- compact structured format, already minimal
* **Images** -- counted at fixed token cost (\~1,600 tokens), not compressed
* **System prompts** -- preserved for prefix cache compatibility
Code Compression [#code-compression]
Headroom includes an AST-aware CodeCompressor (tree-sitter, 8 languages) but it is gated behind safety protections that prevent it from firing in most real-world scenarios. This is intentional.
**Why code mostly passes through:**
1. **Word count gate**: Content under 50 words is silently skipped
2. **Recent code protection** (`protect_recent_code=4`): Code in the last 4 messages is never compressed
3. **Analysis intent protection** (`protect_analysis_context=True`): If the most recent user message contains keywords like "analyze", "review", "explain", "fix", "debug" -- ALL code in the conversation is protected
**Why this is the right default**: Code is almost always fetched because the user wants to work with it. Compressing function bodies would remove exactly what they need.
**Where code savings come from**: The IntelligentContextManager drops old code messages that are no longer relevant (scoring-based), which is a better strategy than stripping function bodies.
**Override**: Set `protect_analysis_context=False` in `ContentRouterConfig` for aggressive code compression. Requires `headroom-ai[code]` for tree-sitter.
JSON Compression Constraints [#json-compression-constraints]
What Gets Compressed [#what-gets-compressed]
* Arrays of **dicts**: Full statistical analysis with adaptive K (Kneedle algorithm)
* Arrays of **strings**: Dedup + adaptive sampling + error preservation
* Arrays of **numbers**: Statistical summary + outlier/change-point preservation
* **Mixed-type** arrays: Grouped by type, each group compressed independently
* **Nested** objects: Recursed into, arrays within are compressed (up to depth 5)
What Passes Through [#what-passes-through]
* Arrays below 5 items (`min_items_to_analyze`)
* Content below 200 tokens (`min_tokens_to_crush`)
* Bool-only arrays
* JSON objects without array values
* Malformed JSON (silently passes through, no error)
Edge Cases [#edge-cases]
* **NaN/Infinity** in numeric fields: Filtered out before statistics are computed
* **Nesting depth > 5**: Inner arrays not examined for compression
* **Mixed-type arrays with small groups**: Groups below `min_items_to_analyze` are kept as-is
Safety Gates [#safety-gates]
All compressors follow the same principle: **fail gracefully, return original content unchanged**.
* Invalid JSON passes through (no error raised)
* AST parse failure falls back to original or LLMLingua
* Compression that makes output larger returns the original
* Missing optional dependencies (tree-sitter, LLMLingua) cause a passthrough with warning log
* Errors are logged at WARNING level and never propagated to callers
LLMLingua out-of-memory during model loading raises a `RuntimeError`. All other failures are silently handled.
Adaptive K: How Item Retention Works [#adaptive-k-how-item-retention-works]
SmartCrusher does not use fixed K values. It uses information-theoretic sizing:
1. **Kneedle algorithm** on bigram coverage curves finds the point where adding more items stops providing new information
2. **SimHash** fingerprinting detects near-duplicate items
3. **zlib validation** ensures the subset captures the full set's diversity
The resulting K is split: 30% from array start, 15% from end, 55% for importance-scored items.
**Safety guarantees (additive, never dropped):**
* Error items (containing "error", "exception", "failed", "critical") -- across ALL array types
* Numeric anomalies (> 2 standard deviations from mean)
* String length anomalies (> 2 standard deviations from mean length)
* Change points (sudden shifts in running values)
These are kept even if they exceed the K budget.
Configuration Tuning [#configuration-tuning]
| Parameter | Default | Effect |
| --------------------------- | ------- | ------------------------------------------------------- |
| `min_items_to_analyze` | 5 | Arrays below this pass through |
| `min_tokens_to_crush` | 200 | Content below this passes through |
| `max_items_after_crush` | 15 | Upper bound on retained items |
| `variance_threshold` | 2.0 | Std devs for anomaly detection (lower = more preserved) |
| `protect_analysis_context` | True | Protect code when user asks about it |
| `protect_recent_code` | 4 | Messages from end to protect code in |
| `skip_user_messages` | True | Never compress user messages |
| `toin_confidence_threshold` | 0.3 | Minimum TOIN confidence to apply hints |
Provider Interactions [#provider-interactions]
* CacheAligner maximizes Anthropic/OpenAI prefix cache hit rates
* Token counting uses model-specific tokenizers (tiktoken for OpenAI, calibrated estimation for Anthropic)
* Compression works with all providers -- no provider-specific limitations
* Compressed content is valid JSON -- downstream tools and parsers work unchanged
TOIN Cold Start [#toin-cold-start]
The Tool Output Intelligence Network (TOIN) learns compression patterns from usage. For new tool types:
* No learned patterns exist -- falls back to statistical heuristics
* Confidence below `toin_confidence_threshold` (default 0.3) -- TOIN hints ignored
* Patterns build up over time as tools are used repeatedly
* Cross-session learning requires persistence (`TelemetryConfig.storage_path`)
# LiteLLM (/docs/litellm)
Headroom integrates with [LiteLLM](https://github.com/BerriAI/litellm) as a callback that compresses messages before they reach any provider. One line to enable, works with all 100+ LiteLLM-supported providers.
Installation [#installation]
```bash
pip install headroom-ai litellm
```
Quick start [#quick-start]
```python
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
litellm.callbacks = [HeadroomCallback()]
# All calls now compressed automatically
response = litellm.completion(model="gpt-4o", messages=[...])
response = litellm.completion(model="bedrock/claude-sonnet", messages=[...])
response = litellm.completion(model="azure/gpt-4o", messages=[...])
```
The callback compresses messages in LiteLLM's `pre_call_hook` before they reach the provider.
How it works [#how-it-works]
1. You call `litellm.completion()` with your messages
2. `HeadroomCallback.pre_call_hook` compresses the messages
3. LiteLLM sends the compressed messages to the provider
4. The response comes back unchanged
This works with every provider LiteLLM supports: OpenAI, Anthropic, Bedrock, Azure, Vertex AI, Cohere, Groq, Mistral, Together, Ollama, and more.
With LiteLLM Proxy [#with-litellm-proxy]
If you run LiteLLM as a proxy server, use the ASGI middleware:
```python
from litellm.proxy.proxy_server import app
from headroom.integrations.asgi import CompressionMiddleware
app.add_middleware(CompressionMiddleware)
```
Or configure via YAML:
```yaml
# litellm_config.yaml
litellm_settings:
callbacks: ["headroom.integrations.litellm_callback.HeadroomCallback"]
```
Direct compress() with LiteLLM [#direct-compress-with-litellm]
You can also use `compress()` directly instead of the callback:
```python
import litellm
from headroom import compress
messages = [{"role": "user", "content": large_content}]
compressed = compress(messages, model="bedrock/claude-sonnet")
response = litellm.completion(
model="bedrock/claude-sonnet",
messages=compressed.messages,
)
print(f"Saved {compressed.tokens_saved} tokens")
```
ASGI middleware [#asgi-middleware]
Drop-in middleware for any ASGI application. Intercepts `/v1/messages`, `/v1/chat/completions`, `/v1/responses`, and `/chat/completions`:
```python
from fastapi import FastAPI
from headroom.integrations.asgi import CompressionMiddleware
app = FastAPI()
app.add_middleware(CompressionMiddleware)
```
Response headers include `x-headroom-compressed: true` and `x-headroom-tokens-saved: 1234`.
# MCP Tools (/docs/mcp)
Headroom's MCP server exposes compression, retrieval, and observability as tools that any MCP-compatible AI coding tool can call -- Claude Code, Cursor, Codex, and more. No proxy required.
Installation [#installation]
```bash
# MCP tools only (lightweight)
pip install "headroom-ai[mcp]"
# Or with the proxy
pip install "headroom-ai[proxy]"
```
Setup for Claude Code [#setup-for-claude-code]
```bash
# Register with Claude Code (one-time)
headroom mcp install
# Start Claude Code — it now has headroom tools
claude
```
Claude Code can now compress content on demand, retrieve originals, and check session stats.
For automatic compression of **all** traffic, also run the proxy:
```bash
# Terminal 1
headroom proxy
# Terminal 2
ANTHROPIC_BASE_URL=http://127.0.0.1:8787 claude
```
Tools [#tools]
headroom_compress [#headroom_compress]
Compress content on demand. The LLM calls this when it wants to shrink large content before reasoning over it.
**Parameters:**
* `content` (required) -- text to compress (files, JSON, logs, search results)
**Returns:**
* `compressed` -- compressed text
* `hash` -- key for retrieving the original later
* `original_tokens` / `compressed_tokens` / `savings_percent`
* `transforms` -- which compression algorithms were applied
Example flow:
```
Claude: Let me compress this large output to save context space.
-> headroom_compress(content="[5000 lines of grep results...]")
<- {
"compressed": "[key matches with context...]",
"hash": "a1b2c3d4e5f6...",
"original_tokens": 12000,
"compressed_tokens": 3200,
"savings_percent": 73.3,
"transforms": ["router:search:0.27"]
}
```
The original is stored locally for 1 hour. If the LLM needs the full content later, it calls `headroom_retrieve`.
headroom_retrieve [#headroom_retrieve]
Retrieve original uncompressed content by hash.
**Parameters:**
* `hash` (required) -- hash key from a previous compression
* `query` (optional) -- search within the original to return only matching items
**Returns:**
* `original_content` (full retrieval) or `results` (filtered search)
* `source` -- `"local"` or `"proxy"`
Retrieval checks the local store first, then falls back to the proxy's store. Hashes from either source work transparently.
headroom_stats [#headroom_stats]
Session compression statistics.
**Returns:**
* `compressions`, `retrievals`, `tokens_saved`, `savings_percent`
* `estimated_cost_saved_usd`
* `recent_events` -- last 10 compression/retrieval events
* `sub_agents` -- stats from sub-agent MCP instances
* `combined` -- main + sub-agent totals
* `proxy` -- request count, cache hits, cost saved (if proxy is running)
Sub-agent stats are aggregated via a shared stats file at `~/.headroom/session_stats.jsonl`.
Streamable HTTP Transport (Remote / Docker) [#streamable-http-transport-remote--docker]
For agents running on a different machine than the Headroom proxy (e.g., Docker, cloud), MCP tools are available over HTTP using the MCP Streamable HTTP protocol.
Proxy auto-exposes /mcp [#proxy-auto-exposes-mcp]
When you run `headroom proxy`, MCP tools are automatically available at `/mcp`:
```bash
headroom proxy # → http://host:8787/mcp
```
Remote agents connect with:
```json
{
"mcpServers": {
"headroom": {
"url": "http://proxy-host:8787/mcp"
}
}
}
```
Standalone HTTP server [#standalone-http-server]
Run MCP tools without the full proxy:
```bash
headroom mcp serve --transport http --port 8080
```
Remote install [#remote-install]
Configure Claude Code to use remote MCP over HTTP:
```bash
headroom mcp install --remote http://proxy-host:8787/mcp
```
This writes URL-based config instead of the default command-based config.
Protocol [#protocol]
The Streamable HTTP transport implements the MCP specification:
* `POST /mcp` -- Send JSON-RPC requests (tool calls, list tools)
* `GET /mcp` -- Server-sent events stream (server-initiated messages)
* `DELETE /mcp` -- Terminate session
Stateless mode by default -- each request is independent, no session tracking needed.
CLI commands [#cli-commands]
```bash
# Install — local (stdio, default)
headroom mcp install
# Install — remote (HTTP, for Docker/network)
headroom mcp install --remote http://proxy-host:8787/mcp
# Install — custom proxy URL
headroom mcp install --proxy-url http://host:9000
# Overwrite existing config
headroom mcp install --force
# Serve — stdio (default, called by Claude Code)
headroom mcp serve
# Serve — HTTP (for remote agents)
headroom mcp serve --transport http --port 8080
# Serve — debug mode
headroom mcp serve --debug
# Check status
headroom mcp status
# Uninstall
headroom mcp uninstall
```
Cross-tool compatibility [#cross-tool-compatibility]
| Tool | Transport | Setup |
| -------------------- | ------------ | ---------------------------------------------------- |
| Claude Code (local) | stdio | `headroom mcp install` |
| Claude Code (remote) | HTTP | `headroom mcp install --remote http://host:8787/mcp` |
| Cursor | stdio / HTTP | Add to MCP settings |
| Docker agents | HTTP | Point to `http://proxy:8787/mcp` |
| Any MCP host | stdio / HTTP | `headroom mcp serve` or `--transport http` |
Architecture [#architecture]
MCP only (no proxy) [#mcp-only-no-proxy]
The LLM calls `headroom_compress` on demand. Compression happens locally in the MCP process. Originals are stored in a local `CompressionStore` with 1-hour TTL.
MCP + Proxy (full setup) [#mcp--proxy-full-setup]
The proxy compresses all traffic at the HTTP level (before the LLM sees content). MCP tools operate after the LLM receives content. They handle different data and do not double-compress.
`headroom_retrieve` checks the local store first, then falls back to the proxy's store.
Remote (HTTP transport) [#remote-http-transport]
```
Remote Agent (any machine)
|
| POST /mcp (JSON-RPC)
v
Headroom Proxy :8787/mcp (Streamable HTTP)
|
| in-process access
v
Compression Pipeline + CompressionStore
```
The proxy's `/mcp` endpoint shares the same compression store and pipeline as the proxy itself -- no HTTP round-trips to self.
Troubleshooting [#troubleshooting]
**"MCP SDK not installed"** -- Run `pip install "headroom-ai[mcp]"`.
**"Proxy not running"** -- Start the proxy with `headroom proxy` in another terminal. Only needed for proxy-backed retrieval.
**"Entry not found or expired"** -- Local content expires after 1 hour, proxy content after 5 minutes.
**Claude doesn't see headroom tools** -- Run `headroom mcp status`, restart Claude Code, and verify with `/mcp` inside Claude Code.
# Persistent Memory (/docs/memory)
LLMs have two fundamental limitations: context windows overflow with too much history, and every conversation starts from zero. Persistent Memory solves both by extracting key facts, persisting them, and injecting them when relevant.
This is **temporal compression** -- instead of carrying 10,000 tokens of conversation history, carry 100 tokens of extracted memories.
Quick Start [#quick-start]
```python
from openai import OpenAI
from headroom import with_memory
# One line -- that's it
client = with_memory(OpenAI(), user_id="alice")
# Use exactly like normal
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "I prefer Python for backend work"}]
)
# Memory extracted INLINE -- zero extra latency
# Later, in a new conversation...
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What language should I use?"}]
)
# Response uses the Python preference from memory
```
How It Works [#how-it-works]
The `with_memory()` wrapper intercepts every chat completion call:
1. **Inject** -- Semantic search finds relevant memories and prepends them to the user message
2. **Instruct** -- Adds a memory extraction instruction to the system prompt
3. **Call** -- Forwards the request to the LLM
4. **Parse** -- Extracts the `` block from the response
5. **Store** -- Saves with embeddings, vector index, and full-text search index
6. **Return** -- Cleans the response (strips the memory block before returning)
Memory extraction happens **inline** as part of the LLM response. No extra API calls, no extra latency.
Hierarchical Scoping [#hierarchical-scoping]
Memories exist at four scope levels, from broadest to narrowest:
| Scope | Persists Across | Use Case |
| ----------- | ------------------------ | ------------------------------- |
| **User** | All sessions, all time | Long-term preferences, identity |
| **Session** | Current session only | Current task context |
| **Agent** | Current agent in session | Agent-specific context |
| **Turn** | Single turn only | Ephemeral working memory |
```python
from openai import OpenAI
from headroom import with_memory
# Session 1: Morning
client1 = with_memory(
OpenAI(),
user_id="bob",
session_id="morning-session",
)
response = client1.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "I prefer Go for performance-critical code"}]
)
# Memory stored at USER level (persists across sessions)
# Session 2: Afternoon (different session, same user)
client2 = with_memory(
OpenAI(),
user_id="bob",
session_id="afternoon-session",
)
response = client2.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What language for my new microservice?"}]
)
# Recalls Go preference from morning session
```
Memory Categories [#memory-categories]
Memories are categorized for better organization and retrieval:
| Category | Description | Examples |
| ------------ | ------------------------------------- | ------------------------------------------------- |
| `PREFERENCE` | Likes, dislikes, preferred approaches | "Prefers Python", "Likes dark mode" |
| `FACT` | Identity, role, constraints | "Works at fintech startup", "Senior engineer" |
| `CONTEXT` | Current goals, ongoing tasks | "Migrating to microservices", "Working on auth" |
| `ENTITY` | Information about entities | "Project Apollo uses React", "Team lead is Sarah" |
| `DECISION` | Decisions made | "Chose PostgreSQL over MySQL" |
| `INSIGHT` | Derived insights | "User tends to prefer typed languages" |
Memory API [#memory-api]
The `with_memory()` wrapper exposes a `.memory` attribute for direct access:
```python
client = with_memory(OpenAI(), user_id="alice")
# Search memories (semantic)
results = client.memory.search("python preferences", top_k=5)
for memory in results:
print(f"{memory.content}")
# Add a memory manually
client.memory.add(
"User is a senior engineer",
category="fact",
importance=0.9,
)
# Get all memories for this user
all_memories = client.memory.get_all()
# Clear all memories
client.memory.clear()
# Get stats
stats = client.memory.stats()
print(f"Total memories: {stats['total']}")
print(f"By category: {stats['categories']}")
```
Temporal Versioning [#temporal-versioning]
When facts change, Headroom creates a **supersession chain** that preserves history:
```python
from headroom.memory import HierarchicalMemory, MemoryCategory
memory = await HierarchicalMemory.create()
# Original fact
orig = await memory.add(
content="User works at Google",
user_id="alice",
category=MemoryCategory.FACT,
)
# User changes jobs -- supersede the old memory
new = await memory.supersede(
old_memory_id=orig.id,
new_content="User now works at Anthropic",
)
# Query current state (excludes superseded by default)
current = await memory.query(MemoryFilter(
user_id="alice",
include_superseded=False,
))
# Returns only "User now works at Anthropic"
# Get the full chain
chain = await memory.get_history(new.id)
# [
# Memory(content="User works at Google", is_current=False),
# Memory(content="User now works at Anthropic", is_current=True),
# ]
```
This gives you an audit trail, the ability to debug why the LLM made certain decisions, and rollback if needed.
Backends [#backends]
Embedder Backends [#embedder-backends]
```python
from headroom.memory import MemoryConfig, EmbedderBackend
# Local embeddings (recommended -- fast, free, private)
config = MemoryConfig(
embedder_backend=EmbedderBackend.LOCAL,
embedder_model="all-MiniLM-L6-v2",
)
# OpenAI embeddings (higher quality, costs money)
config = MemoryConfig(
embedder_backend=EmbedderBackend.OPENAI,
openai_api_key="sk-...",
embedder_model="text-embedding-3-small",
)
# Ollama embeddings (local server, many models)
config = MemoryConfig(
embedder_backend=EmbedderBackend.OLLAMA,
ollama_base_url="http://localhost:11434",
embedder_model="nomic-embed-text",
)
```
Storage [#storage]
Storage uses **SQLite** for CRUD and filtering, **HNSW** for vector similarity search, and **FTS5** for full-text keyword search. All embedded -- no external services required.
```python
config = MemoryConfig(
db_path="memory.db",
vector_dimension=384,
hnsw_ef_construction=200,
hnsw_m=16,
hnsw_ef_search=50,
cache_enabled=True,
cache_max_size=1000,
)
```
Provider Compatibility [#provider-compatibility]
Memory works with any OpenAI-compatible client:
```python
from openai import OpenAI
from headroom import with_memory
# OpenAI
client = with_memory(OpenAI(), user_id="alice")
# Azure OpenAI
client = with_memory(
OpenAI(base_url="https://your-resource.openai.azure.com/..."),
user_id="alice",
)
# Groq
from groq import Groq
client = with_memory(Groq(), user_id="alice")
```
Performance [#performance]
| Operation | Latency | Notes |
| ----------------- | -------------- | ------------------------------ |
| Memory injection | \<50ms | Local embeddings + HNSW search |
| Memory extraction | +50-100 tokens | Part of LLM response (inline) |
| Memory storage | \<10ms | SQLite + HNSW + FTS5 indexing |
| Cache hit | \<1ms | LRU cache lookup |
# Metrics & Monitoring (/docs/metrics)
Headroom provides comprehensive metrics for monitoring compression performance, cost savings, and system health through both the proxy server and the SDK.
Proxy Endpoints [#proxy-endpoints]
Stats Endpoint [#stats-endpoint]
```bash
curl http://localhost:8787/stats
```
```json
{
"persistent_savings": {
"lifetime": {
"tokens_saved": 12500,
"compression_savings_usd": 0.04
}
},
"requests": {
"total": 42,
"cached": 5,
"rate_limited": 0,
"failed": 0
},
"tokens": {
"input": 50000,
"output": 8000,
"saved": 12500,
"savings_percent": 25.0
},
"cost": {
"total_cost_usd": 0.15,
"total_savings_usd": 0.04
},
"cache": {
"entries": 10,
"total_hits": 5
}
}
```
Persistent savings are stored at `~/.headroom/proxy_savings.json` and survive proxy restarts. Override the path with `HEADROOM_SAVINGS_PATH`.
Historical Savings [#historical-savings]
```bash
curl http://localhost:8787/stats-history
```
Returns durable compression history with hourly, daily, weekly, and monthly rollups. Supports CSV export:
```bash
curl "http://localhost:8787/stats-history?format=csv&series=daily"
curl "http://localhost:8787/stats-history?format=csv&series=monthly"
```
Prometheus Metrics [#prometheus-metrics]
```bash
curl http://localhost:8787/metrics
```
```
# HELP headroom_requests_total Total requests processed
headroom_requests_total{mode="optimize"} 1234
# HELP headroom_tokens_saved_total Total tokens saved
headroom_tokens_saved_total 5678900
# HELP headroom_compression_ratio Compression ratio histogram
headroom_compression_ratio_bucket{le="0.5"} 890
headroom_compression_ratio_bucket{le="0.7"} 1100
headroom_compression_ratio_bucket{le="0.9"} 1200
# HELP headroom_latency_seconds Request latency histogram
headroom_latency_seconds_bucket{le="0.01"} 800
headroom_latency_seconds_bucket{le="0.1"} 1150
# HELP headroom_cache_hits_total Cache hit counter
headroom_cache_hits_total 456
```
Health Check [#health-check]
```bash
curl http://localhost:8787/health
```
```json
{
"status": "healthy",
"version": "0.1.0",
"uptime_seconds": 3600,
"llmlingua_enabled": false
}
```
SDK Metrics [#sdk-metrics]
Proxy Stats [#proxy-stats]
The TypeScript SDK queries the proxy for stats:
```ts twoslash
import { HeadroomClient } from 'headroom-ai';
const client = new HeadroomClient();
// Get proxy stats
const stats = await client.proxyStats();
console.log(`Tokens saved: ${stats.tokens.saved}`);
console.log(`Savings: ${stats.tokens.savings_percent}%`);
```
Compression Result Metrics [#compression-result-metrics]
Every `compress()` call returns metrics:
```ts twoslash
import { compress } from 'headroom-ai';
const result = await compress(messages, { model: 'gpt-4o' });
console.log(`Tokens: ${result.tokensBefore} -> ${result.tokensAfter}`);
console.log(`Saved: ${result.tokensSaved} (${(result.compressionRatio * 100).toFixed(1)}%)`);
console.log(`Transforms: ${result.transformsApplied.join(', ')}`);
```
Session Stats [#session-stats]
Quick stats for the current session (no database query):
```python
stats = client.get_stats()
print(f"Mode: {stats['config']['mode']}")
print(f"Tokens saved: {stats['session']['tokens_saved_total']}")
print(f"Avg compression: {stats['session']['compression_ratio_avg']:.1%}")
```
Returns:
```python
{
"session": {
"requests_total": 10,
"tokens_input_before": 50000,
"tokens_input_after": 35000,
"tokens_saved_total": 15000,
"tokens_output_total": 8000,
"cache_hits": 3,
"compression_ratio_avg": 0.70,
},
"config": {
"mode": "optimize",
"provider": "openai",
"cache_optimizer_enabled": True,
"semantic_cache_enabled": False,
},
"transforms": {
"smart_crusher_enabled": True,
"cache_aligner_enabled": True,
"rolling_window_enabled": True,
},
}
```
Historical Metrics [#historical-metrics]
Query stored metrics from the database:
```python
from datetime import datetime, timedelta
metrics = client.get_metrics(
start_time=datetime.utcnow() - timedelta(hours=1),
limit=100,
)
for m in metrics:
print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")
```
Summary Statistics [#summary-statistics]
Aggregate statistics across all stored metrics:
```python
summary = client.get_summary()
print(f"Total requests: {summary['total_requests']}")
print(f"Total tokens saved: {summary['total_tokens_saved']}")
print(f"Average compression: {summary['avg_compression_ratio']:.1%}")
print(f"Total cost savings: ${summary['total_cost_saved_usd']:.2f}")
```
Logging [#logging]
```python
import logging
# INFO level shows compression summaries
logging.basicConfig(level=logging.INFO)
# DEBUG level shows detailed transform decisions
logging.basicConfig(level=logging.DEBUG)
```
Example output:
```
INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction)
INFO:headroom.transforms.smart_crusher:SmartCrusher applied top_n strategy: kept 15 of 1000 items
DEBUG:headroom.transforms.smart_crusher:Kept items: [0,1,2,42,77,97,98,99] (errors at 42, warnings at 77)
```
```bash
# Log to file
headroom proxy --log-file headroom.jsonl
# Increase verbosity
headroom proxy --log-level debug
```
Cost Tracking [#cost-tracking]
Budget Alerts [#budget-alerts]
Set a budget limit in the proxy:
```bash
headroom proxy --budget 10.00
```
When the budget is exceeded, requests return a budget exceeded error, the `/stats` endpoint shows budget status, and logs indicate the budget state.
Key Metrics to Monitor [#key-metrics-to-monitor]
| Metric | What It Tells You | Target |
| ----------------------- | ------------------- | ---------------- |
| `tokens_saved_total` | Total cost savings | Higher is better |
| `compression_ratio_avg` | Efficiency | 0.7--0.9 typical |
| `cache_hit_rate` | Cache effectiveness | >20% is good |
| `latency_p99` | Performance impact | \<10ms |
| `failed_requests` | Reliability | 0 |
Grafana Dashboard [#grafana-dashboard]
Example Prometheus queries for a Grafana dashboard:
| Panel | PromQL |
| -------------------------- | --------------------------------------------------------------------------------------- |
| Tokens Saved | `headroom_tokens_saved_total` |
| Compression Ratio (median) | `histogram_quantile(0.5, headroom_compression_ratio_bucket)` |
| Request Latency (p99) | `histogram_quantile(0.99, headroom_latency_seconds_bucket)` |
| Cache Hit Rate | `headroom_cache_hits_total / (headroom_cache_hits_total + headroom_cache_misses_total)` |
# OpenAI SDK (/docs/openai-sdk)
Headroom wraps the OpenAI Node.js SDK to automatically compress messages before every `chat.completions.create()` call. All other methods (embeddings, images, audio) pass through unchanged.
Installation [#installation]
```bash
npm install headroom-ai openai
```
The TypeScript SDK sends messages to a local Headroom proxy for compression. Start the proxy before using the SDK:
```bash
pip install "headroom-ai[proxy]"
headroom proxy
```
Quick start [#quick-start]
```ts twoslash
import { withHeadroom } from 'headroom-ai/openai';
import OpenAI from 'openai';
const client = withHeadroom(new OpenAI());
// Messages are compressed automatically before sending
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: longConversation,
});
```
That's it. Every call to `client.chat.completions.create()` compresses the messages first. The response format is identical to the unwrapped client.
How it works [#how-it-works]
`withHeadroom()` returns a proxy around your OpenAI client that intercepts `chat.completions.create()`:
1. Extracts `messages` from the request params
2. Sends them to the Headroom proxy's `/v1/compress` endpoint
3. Replaces the original messages with the compressed result
4. Forwards the request to OpenAI as normal
All other client methods are untouched:
```ts twoslash
import { withHeadroom } from 'headroom-ai/openai';
import OpenAI from 'openai';
const client = withHeadroom(new OpenAI());
// These pass through unchanged
const embedding = await client.embeddings.create({
model: 'text-embedding-3-small',
input: 'Hello world',
});
```
Options [#options]
Pass compression options as the second argument:
```ts twoslash
import { withHeadroom } from 'headroom-ai/openai';
import OpenAI from 'openai';
const client = withHeadroom(new OpenAI(), {
model: 'gpt-4o',
baseUrl: 'http://localhost:8787',
});
```
Streaming [#streaming]
Streaming works normally. Compression happens before the request is sent:
```ts twoslash
import { withHeadroom } from 'headroom-ai/openai';
import OpenAI from 'openai';
const client = withHeadroom(new OpenAI());
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: longConversation,
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
```
Tool calling [#tool-calling]
Tool call messages and tool results are compressed like any other message content. Large tool outputs (JSON arrays, logs) see the biggest savings:
```ts twoslash
import { withHeadroom } from 'headroom-ai/openai';
import OpenAI from 'openai';
const client = withHeadroom(new OpenAI());
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'user', content: 'Search for recent errors' },
{
role: 'assistant',
content: null,
tool_calls: [{ id: 'call_1', type: 'function', function: { name: 'search', arguments: '{"q":"errors"}' } }],
},
{
role: 'tool',
tool_call_id: 'call_1',
content: hugeJsonResult, // Compressed automatically
},
],
tools: [{ type: 'function', function: { name: 'search', parameters: {} } }],
});
```
# Proxy Server (/docs/proxy)
The Headroom proxy is a standalone HTTP server that compresses all LLM traffic passing through it. Point any client at the proxy and get automatic context optimization.
Starting the proxy [#starting-the-proxy]
```bash
# Basic usage
headroom proxy
# Custom host and port
headroom proxy --host 0.0.0.0 --port 8080
# With logging and budget
headroom proxy \
--log-file /var/log/headroom.jsonl \
--budget 100.0
```
Telemetry is enabled by default. Opt out with `HEADROOM_TELEMETRY=off` or `--no-telemetry`.
CLI options [#cli-options]
Core [#core]
| Option | Default | Description |
| ------------------ | ------------------------ | --------------------------------------- |
| `--host` | `127.0.0.1` | Host to bind to |
| `--port` | `8787` | Port to bind to |
| `--no-optimize` | `false` | Disable optimization (passthrough mode) |
| `--no-cache` | `false` | Disable semantic caching |
| `--no-rate-limit` | `false` | Disable rate limiting |
| `--log-file` | None | Path to JSONL log file |
| `--budget` | None | Daily budget limit in USD |
| `--openai-api-url` | `https://api.openai.com` | Custom OpenAI API URL |
Context management [#context-management]
| Option | Default | Description |
| -------------------------- | ------- | ------------------------------------------------- |
| `--no-intelligent-context` | `false` | Fall back to RollingWindow (oldest-first drops) |
| `--no-intelligent-scoring` | `false` | Disable multi-factor importance scoring |
| `--no-compress-first` | `false` | Disable trying deeper compression before dropping |
By default, the proxy uses **IntelligentContextManager** which scores messages by recency, semantic similarity, TOIN-learned patterns, error indicators, and forward references. Dropped messages are stored in CCR for retrieval.
```bash
# Use legacy RollingWindow
headroom proxy --no-intelligent-context
# Faster but less intelligent scoring
headroom proxy --no-intelligent-scoring
```
LLMLingua (ML compression) [#llmlingua-ml-compression]
| Option | Default | Description |
| -------------------- | ------- | ---------------------------------------- |
| `--llmlingua` | `false` | Enable LLMLingua-2 ML-based compression |
| `--llmlingua-device` | `auto` | Device: `auto`, `cuda`, `cpu`, `mps` |
| `--llmlingua-rate` | `0.3` | Target compression rate (0.3 = keep 30%) |
```bash
pip install "headroom-ai[llmlingua]"
headroom proxy --llmlingua --llmlingua-device cuda
headroom proxy --llmlingua --llmlingua-rate 0.2
```
LLMLingua adds \~2 GB of dependencies (torch, transformers), 10-30s cold start, and \~1 GB RAM. Enable when maximum compression justifies the cost.
API endpoints [#api-endpoints]
`GET /health` [#get-health]
```bash
curl http://localhost:8787/health
```
```json
{
"status": "healthy",
"optimize": true,
"stats": {
"total_requests": 42,
"tokens_saved": 15000,
"savings_percent": 45.2
}
}
```
`GET /stats` [#get-stats]
Live session statistics plus durable `persistent_savings` totals. Stored at `~/.headroom/proxy_savings.json` (override with `HEADROOM_SAVINGS_PATH`).
```bash
curl http://localhost:8787/stats
```
`GET /stats-history` [#get-stats-history]
Durable history with hourly, daily, weekly, and monthly rollups. Powers the `/dashboard` view.
```bash
curl http://localhost:8787/stats-history
curl "http://localhost:8787/stats-history?format=csv&series=weekly"
```
`GET /metrics` [#get-metrics]
Prometheus-format metrics for monitoring.
```bash
curl http://localhost:8787/metrics
```
```
headroom_requests_total{mode="optimize"} 1234
headroom_tokens_saved_total 5678900
headroom_compression_ratio_bucket{le="0.5"} 890
headroom_latency_seconds_bucket{le="0.01"} 800
headroom_cache_hits_total 456
```
`POST /v1/messages` [#post-v1messages]
Anthropic API format. The proxy compresses messages, forwards to Anthropic, and returns the response.
`POST /v1/chat/completions` [#post-v1chatcompletions]
OpenAI API format. The proxy compresses messages, forwards to OpenAI, and returns the response.
`POST /v1/compress` [#post-v1compress]
Compression-only endpoint. Compresses messages without calling any LLM. Used by the TypeScript SDK.
**Request:**
```json
{
"messages": [{ "role": "user", "content": "..." }],
"model": "gpt-4o"
}
```
**Response:**
```json
{
"messages": [{ "role": "user", "content": "..." }],
"tokens_before": 15000,
"tokens_after": 3500,
"tokens_saved": 11500,
"compression_ratio": 0.23,
"transforms_applied": ["router:smart_crusher:0.35"],
"ccr_hashes": ["a1b2c3"]
}
```
Set `x-headroom-bypass: true` to skip compression.
Agent wrapping [#agent-wrapping]
Use `headroom wrap` to transparently proxy any CLI tool:
```bash
# Claude Code
headroom wrap claude
# OpenAI Codex
headroom wrap codex
# Aider
headroom wrap aider
# Cursor
headroom wrap cursor
```
Or set the base URL manually:
```bash
# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude
# Cursor / any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 cursor
```
Cloud providers [#cloud-providers]
```bash
# AWS Bedrock
headroom proxy --backend bedrock --region us-east-1
# Google Vertex AI
headroom proxy --backend vertex_ai --region us-central1
# Azure OpenAI
headroom proxy --backend azure
# OpenRouter (400+ models)
OPENROUTER_API_KEY=sk-or-... headroom proxy --backend openrouter
```
Environment variables [#environment-variables]
```bash
export HEADROOM_HOST=0.0.0.0
export HEADROOM_PORT=8787
export HEADROOM_BUDGET=100.0
export OPENAI_TARGET_API_URL=https://custom.openai.endpoint.com
headroom proxy
```
Production deployment [#production-deployment]
gunicorn [#gunicorn]
```bash
pip install gunicorn
gunicorn headroom.proxy.server:app \
--workers 4 \
--bind 0.0.0.0:8787 \
--worker-class uvicorn.workers.UvicornWorker
```
Docker [#docker]
```dockerfile
FROM python:3.11-slim
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
&& pip install "headroom-ai[proxy]" \
&& apt-get purge -y build-essential && apt-get autoremove -y \
&& rm -rf /var/lib/apt/lists/*
EXPOSE 8787
CMD ["headroom", "proxy", "--host", "0.0.0.0"]
```
`build-essential` is required at install time because `headroom-ai` includes `hnswlib`, a C++ extension compiled from source. It is removed after installation to keep the image slim.
# Quickstart (/docs/quickstart)
This guide gets you from zero to compressed LLM calls in under 5 minutes.
1\. Install [#1-install]
```bash
npm install headroom-ai
```
```bash
pip install "headroom-ai[all]"
```
The TypeScript SDK sends messages to a local Headroom proxy for compression. Start the proxy before using the TS SDK:
```bash
pip install "headroom-ai[proxy]"
headroom proxy --port 8787
```
The proxy runs the compression pipeline (Python) and exposes an HTTP API that the TS SDK calls.
2\. Compress messages [#2-compress-messages]
```ts twoslash
import { compress } from 'headroom-ai';
const messages = [
{ role: 'system' as const, content: 'You analyze search results.' },
{ role: 'user' as const, content: 'Search for Python tutorials.' },
{
role: 'assistant' as const,
content: null,
tool_calls: [{
id: 'call_1',
type: 'function' as const,
function: { name: 'search', arguments: '{"q": "python"}' },
}],
},
{
role: 'tool' as const,
tool_call_id: 'call_1',
content: JSON.stringify({
results: Array.from({ length: 500 }, (_, i) => ({
title: `Result ${i}`,
snippet: `Description ${i}`,
score: 100 - i,
})),
}),
},
{ role: 'user' as const, content: 'What are the top 3 results?' },
];
const result = await compress(messages, {
model: 'gpt-4o',
baseUrl: 'http://localhost:8787',
});
```
```python
from headroom import compress
import json
messages = [
{"role": "system", "content": "You analyze search results."},
{"role": "user", "content": "Search for Python tutorials."},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_1",
"type": "function",
"function": {"name": "search", "arguments": '{"q": "python"}'},
}],
},
{
"role": "tool",
"tool_call_id": "call_1",
"content": json.dumps({
"results": [
{"title": f"Result {i}", "snippet": f"Description {i}", "score": 100 - i}
for i in range(500)
]
}),
},
{"role": "user", "content": "What are the top 3 results?"},
]
result = compress(messages, model="gpt-4o")
```
3\. Send to your LLM [#3-send-to-your-llm]
Use the compressed messages exactly like the originals:
```ts twoslash
import OpenAI from 'openai';
const client = new OpenAI();
// result.messages from the previous step
const messages: any[] = [];
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages,
});
console.log(response.choices[0].message.content);
```
```python
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=result.messages,
)
print(response.choices[0].message.content)
```
4\. Check your savings [#4-check-your-savings]
```ts twoslash
const result = {
tokensBefore: 45000,
tokensAfter: 4500,
tokensSaved: 40500,
compressionRatio: 0.9,
transformsApplied: ['smart_crusher', 'cache_aligner'],
messages: [],
ccrHashes: [],
compressed: true,
};
// ---cut---
console.log(`Tokens before: ${result.tokensBefore}`);
console.log(`Tokens after: ${result.tokensAfter}`);
console.log(`Tokens saved: ${result.tokensSaved}`);
console.log(`Compression: ${(result.compressionRatio * 100).toFixed(0)}%`);
console.log(`Transforms: ${result.transformsApplied.join(', ')}`);
```
Example output:
```
Tokens before: 45000
Tokens after: 4500
Tokens saved: 40500
Compression: 90%
Transforms: smart_crusher, cache_aligner
```
```python
print(f"Tokens before: {result.tokens_before}")
print(f"Tokens after: {result.tokens_after}")
print(f"Tokens saved: {result.tokens_saved}")
print(f"Compression: {result.compression_ratio:.0%}")
print(f"Transforms: {result.transforms_applied}")
```
Example output:
```
Tokens before: 45000
Tokens after: 4500
Tokens saved: 40500
Compression: 90%
Transforms: ['smart_crusher', 'cache_aligner']
```
Alternative: proxy mode (zero code changes) [#alternative-proxy-mode-zero-code-changes]
If you do not want to change any code, run Headroom as a proxy and point your existing client at it:
```bash
# Start the proxy
headroom proxy --port 8787
# Point Claude Code at it
ANTHROPIC_BASE_URL=http://localhost:8787 claude
# Or any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 your-app
```
All requests flow through Headroom automatically. Check savings at any time:
```bash
curl http://localhost:8787/stats
# {"requests_total": 42, "tokens_saved_total": 125000, ...}
```
What gets compressed [#what-gets-compressed]
The biggest savings come from tool outputs -- search results, database rows, log files, API responses. Headroom auto-detects the content type and routes it to the best compressor. No configuration needed.
| Content type | Compressor | Typical savings |
| --------------- | ---------------- | --------------- |
| JSON arrays | SmartCrusher | 70--90% |
| Source code | CodeCompressor | 40--70% |
| Build/test logs | LogCompressor | 80--95% |
| Search results | SearchCompressor | 60--80% |
| Plain text | Kompress | 30--50% |
Next steps [#next-steps]
# SharedContext (/docs/shared-context)
When agents hand off to each other, context gets replayed in full. SharedContext compresses what moves between agents using Headroom's compression pipeline, typically saving **\~80% of tokens** on agent handoffs.
Quick Start [#quick-start]
```ts twoslash
import { SharedContext } from "headroom";
const ctx = new SharedContext();
// Agent A stores large output
const entry = await ctx.put("research", bigResearchOutput, {
agent: "researcher",
});
// Agent B gets compressed version (~80% smaller)
const summary = ctx.get("research");
// Agent B needs full details on demand
const full = ctx.get("research", { full: true });
```
```python
from headroom import SharedContext
ctx = SharedContext()
# Agent A stores large output
ctx.put("research", big_research_output, agent="researcher")
# Agent B gets compressed version (~80% smaller)
summary = ctx.get("research")
# Agent B needs full details on demand
full = ctx.get("research", full=True)
```
API [#api]
`put(key, content, agent?)` [#putkey-content-agent]
Store content under a key. Compresses automatically using Headroom's full pipeline (SmartCrusher for JSON, CodeCompressor for code, Kompress for text).
```ts twoslash
import { SharedContext } from "headroom";
const ctx = new SharedContext();
// ---cut---
const entry = await ctx.put("findings", bigJsonOutput, {
agent: "researcher",
});
entry.originalTokens; // 20000
entry.compressedTokens; // 4000
entry.savingsPercent; // 80.0
entry.transforms; // ["router:json:0.20"]
```
```python
entry = ctx.put("findings", big_json_output, agent="researcher")
entry.original_tokens # 20,000
entry.compressed_tokens # 4,000
entry.savings_percent # 80.0
entry.transforms # ["router:json:0.20"]
```
`get(key, full?)` [#getkey-full]
Retrieve content. Returns the compressed version by default, or the original with `full=True`.
```ts twoslash
import { SharedContext } from "headroom";
const ctx = new SharedContext();
// ---cut---
const compressed = ctx.get("findings"); // 4K tokens
const original = ctx.get("findings", { full: true }); // 20K tokens
const missing = ctx.get("nonexistent"); // null
```
```python
compressed = ctx.get("findings") # 4K tokens
original = ctx.get("findings", full=True) # 20K tokens
missing = ctx.get("nonexistent") # None
```
`stats()` [#stats]
Aggregated statistics across all entries.
```ts twoslash
import { SharedContext } from "headroom";
const ctx = new SharedContext();
// ---cut---
const stats = ctx.stats();
stats.entries; // 3
stats.totalOriginalTokens; // 60000
stats.totalCompressedTokens; // 12000
stats.totalTokensSaved; // 48000
stats.savingsPercent; // 80.0
```
```python
stats = ctx.stats()
stats.entries # 3
stats.total_original_tokens # 60000
stats.total_compressed_tokens # 12000
stats.total_tokens_saved # 48000
stats.savings_percent # 80.0
```
`keys()` and `clear()` [#keys-and-clear]
`keys()` lists all non-expired keys. `clear()` removes all entries.
Configuration [#configuration]
```ts twoslash
import { SharedContext } from "headroom";
// ---cut---
const ctx = new SharedContext({
model: "claude-sonnet-4-5-20250929", // For token counting
ttl: 3600, // 1 hour (default)
maxEntries: 100, // Evicts oldest when full
});
```
```python
ctx = SharedContext(
model="claude-sonnet-4-5-20250929", # For token counting
ttl=3600, # 1 hour (default)
max_entries=100, # Evicts oldest when full
)
```
Entries expire after `ttl` seconds. When `maxEntries` is reached, the oldest entry is evicted.
Framework Examples [#framework-examples]
SharedContext is framework-agnostic. It works anywhere context moves between agents.
CrewAI [#crewai]
```python
from headroom import SharedContext
ctx = SharedContext()
# After researcher task completes
ctx.put("findings", researcher_task.output.raw)
# Coder task gets compressed context
coder_context = ctx.get("findings")
```
LangGraph [#langgraph]
```python
from headroom import SharedContext
ctx = SharedContext()
def researcher_node(state):
result = do_research()
ctx.put("research", result)
return {"research_summary": ctx.get("research")}
def coder_node(state):
# Compressed summary in state, full details on demand
full = ctx.get("research", full=True)
return {"code": write_code(full)}
```
OpenAI Agents SDK [#openai-agents-sdk]
```python
from headroom import SharedContext
ctx = SharedContext()
def compress_handoff(messages):
for msg in messages:
if len(msg.content) > 1000:
ctx.put(msg.id, msg.content)
msg.content = ctx.get(msg.id)
return messages
handoff(agent=coder, input_filter=compress_handoff)
```
How It Works [#how-it-works]
Under the hood, `put()` calls `headroom.compress()` -- the same pipeline used by the Headroom proxy -- and stores the original in memory. `get()` returns the compressed version. `get(full=True)` returns the original.
The compression pipeline routes content to the best compressor:
* **JSON arrays** -- SmartCrusher (70-95% compression)
* **Code** -- CodeCompressor (AST-aware)
* **Text** -- Kompress (ModernBERT-based) or passthrough
# Simulation (/docs/simulation)
Simulation mode lets you preview what Headroom would do to your messages without sending them to an LLM. This is useful for cost estimation, debugging compression behavior, and understanding where token waste comes from.
Basic Usage [#basic-usage]
```ts twoslash
import { compress } from 'headroom-ai';
// compress() returns the same result structure —
// use it without sending to your LLM to simulate
const result = await compress(messages, { model: 'gpt-4o' });
console.log(`Would save: ${result.tokensSaved} tokens`);
console.log(`Compression ratio: ${(result.compressionRatio * 100).toFixed(1)}%`);
console.log(`Transforms: ${result.transformsApplied.join(', ')}`);
```
```python
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=large_conversation,
)
print(f"Tokens before: {plan.tokens_before}")
print(f"Tokens after: {plan.tokens_after}")
print(f"Would save: {plan.tokens_saved} tokens ({plan.savings_percent:.1f}%)")
print(f"Transforms: {plan.transforms_applied}")
```
Waste Signals [#waste-signals]
Simulation reports where token waste comes from in your messages:
```python
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=messages,
)
waste = plan.waste_signals
print(f"JSON bloat: {waste.json_bloat_tokens} tokens")
print(f"HTML noise: {waste.html_noise_tokens} tokens")
print(f"Whitespace: {waste.whitespace_tokens} tokens")
print(f"Dynamic dates: {waste.dynamic_date_tokens} tokens")
print(f"Repetition: {waste.repetition_tokens} tokens")
```
Waste signals help you understand which parts of your input are contributing the most unnecessary tokens.
Block Breakdown [#block-breakdown]
The parser breaks your conversation into blocks so you can see where tokens are concentrated:
```python
# Block types: system, user, assistant, tool_call, tool_result, rag
# The breakdown shows token counts per block type
```
| Block Kind | Description |
| ------------- | ------------------------------------- |
| `system` | System prompt instructions |
| `user` | User messages |
| `assistant` | Model responses |
| `tool_call` | Function call requests |
| `tool_result` | Tool output (largest source of waste) |
| `rag` | Retrieved document context |
Use Cases [#use-cases]
Cost Estimation [#cost-estimation]
Run simulation on a representative sample of your workload to estimate savings before enabling `optimize` mode:
```python
import json
total_before = 0
total_after = 0
for messages in sample_conversations:
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=messages,
)
total_before += plan.tokens_before
total_after += plan.tokens_after
savings_pct = (1 - total_after / total_before) * 100
print(f"Estimated savings: {savings_pct:.1f}%")
print(f"Tokens saved: {total_before - total_after:,}")
```
Debugging Compression [#debugging-compression]
Use simulation to understand why a particular conversation is or is not being compressed:
```python
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=messages,
)
if plan.tokens_saved == 0:
print("No compression applied. Possible reasons:")
print("- Messages are too short (< 200 tokens per tool output)")
print("- No tool outputs with compressible JSON arrays")
print("- Content is already compact (code, grep results)")
else:
print(f"Transforms applied: {plan.transforms_applied}")
# See the optimized messages
print(json.dumps(plan.messages_optimized, indent=2))
```
Comparing Configurations [#comparing-configurations]
Test different configurations to find the best settings for your workload:
```python
from headroom import HeadroomClient, OpenAIProvider
from headroom.transforms import SmartCrusherConfig
configs = [
SmartCrusherConfig(max_items_after_crush=10),
SmartCrusherConfig(max_items_after_crush=25),
SmartCrusherConfig(max_items_after_crush=50),
]
for config in configs:
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
smart_crusher_config=config,
)
plan = client.chat.completions.simulate(model="gpt-4o", messages=messages)
print(f"max_items={config.max_items_after_crush}: "
f"{plan.tokens_saved} tokens saved ({plan.savings_percent:.1f}%)")
```
Simulation never calls the LLM API. It runs the full transform pipeline locally and returns the results, so there is no cost and no latency from the provider.
# SmartCrusher (/docs/smart-crusher)
SmartCrusher is Headroom's compressor for JSON tool outputs. It analyzes arrays statistically, keeps the important items (errors, anomalies, relevant matches), and drops the rest. This is the compressor that fires automatically when ContentRouter detects JSON arrays.
How It Works [#how-it-works]
SmartCrusher doesn't blindly truncate arrays. It scores each item across five dimensions:
1. **First/Last items** -- Context for pagination and recency
2. **Error items** -- 100% preservation of error states (never dropped)
3. **Anomalies** -- Statistical outliers (> 2 standard deviations from the mean)
4. **Relevant items** -- Matches to the user's query via BM25/embeddings
5. **Change points** -- Significant transitions in data
The result: a 1,000-item array becomes \~50 items with all the information the LLM actually needs.
What Gets Preserved [#what-gets-preserved]
| Category | Preserved | Why |
| --------- | --------- | -------------------------- |
| Errors | 100% | Critical for debugging |
| First N | 100% | Context and pagination |
| Last N | 100% | Recency |
| Anomalies | All | Unusual values matter |
| Relevant | Top K | Match user's query |
| Others | Sampled | Statistical representation |
Quick Start [#quick-start]
```ts twoslash
import { compress } from "headroom-ai";
// SmartCrusher fires automatically for JSON tool outputs
const messages = [
{ role: "system" as const, content: "You are a helpful assistant." },
{ role: "user" as const, content: "Find errors in the last 24 hours" },
{
role: "tool" as const,
content: JSON.stringify({ results: new Array(1000).fill({ status: "ok" }) }),
tool_call_id: "call_1",
},
];
const result = await compress(messages);
console.log(`Tokens saved: ${result.tokensSaved}`);
// SmartCrusher keeps errors, anomalies, and relevant items
```
```python
from headroom import SmartCrusher
crusher = SmartCrusher()
# Before: 1000 search results (45,000 tokens)
tool_output = {"results": ["...1000 items..."]}
# After: ~50 important items (4,500 tokens) -- 90% reduction
compressed = crusher.crush(tool_output, query="user's question")
```
Configuration [#configuration]
```ts twoslash
import { compress } from "headroom-ai";
// Configure via the Headroom proxy or HeadroomClient
const result = await compress(messages, {
model: "gpt-4o",
tokenBudget: 10000, // SmartCrusher will reduce JSON to fit
});
console.log(`Transforms: ${result.transformsApplied}`);
// ["smart_crusher", "cache_aligner"]
```
```python
from headroom import SmartCrusher, SmartCrusherConfig
config = SmartCrusherConfig(
min_tokens_to_crush=200, # Only compress if > 200 tokens
max_items_after_crush=50, # Keep at most 50 items
keep_first=3, # Always keep first 3 items
keep_last=2, # Always keep last 2 items
relevance_threshold=0.3, # Keep items with relevance > 0.3
anomaly_std_threshold=2.0, # Keep items > 2 std dev from mean
preserve_errors=True, # Always keep error items
)
crusher = SmartCrusher(config)
compressed = crusher.crush(tool_output, query="find payment failures")
```
Configuration Options [#configuration-options]
| Option | Default | Description |
| ----------------------- | ------- | ---------------------------------------------------- |
| `min_tokens_to_crush` | `200` | Only compress arrays with more than this many tokens |
| `max_items_after_crush` | `50` | Maximum items to keep after compression |
| `keep_first` | `3` | Always keep the first N items |
| `keep_last` | `2` | Always keep the last N items |
| `relevance_threshold` | `0.3` | Minimum relevance score to keep an item |
| `anomaly_std_threshold` | `2.0` | Standard deviation threshold for anomaly detection |
| `preserve_errors` | `True` | Always keep items containing error states |
Example: Before and After [#example-before-and-after]
Consider a tool that returns 1,000 search results:
```python
# Before compression: 45,000 tokens
{
"results": [
{"id": 1, "status": "ok", "message": "Success", "timestamp": "..."},
{"id": 2, "status": "ok", "message": "Success", "timestamp": "..."},
# ... 995 more "ok" results ...
{"id": 998, "status": "error", "message": "Connection timeout", "timestamp": "..."},
{"id": 999, "status": "ok", "message": "Success", "timestamp": "..."},
{"id": 1000, "status": "ok", "message": "Success", "timestamp": "..."},
]
}
# After SmartCrusher: 4,500 tokens (90% reduction)
# Kept: first 3, last 2, the error at id=998, statistical sample
```
The LLM sees the structure, the error, and a representative sample -- everything it needs to answer "find errors in the last 24 hours" without wading through 1,000 identical success responses.
You don't need to call SmartCrusher directly. The ContentRouter detects JSON arrays and routes them to SmartCrusher automatically. Direct usage is available when you want fine-grained control over the configuration.
# Strands (/docs/strands)
Headroom integrates with [Strands Agents](https://github.com/strands-agents/sdk-python) through two patterns: wrap the model for full conversation compression, or hook into tool calls for targeted tool output compression.
Installation [#installation]
```bash
pip install headroom-ai strands-agents
```
Quick start [#quick-start]
```python
from strands import Agent
from strands.models.bedrock import BedrockModel
from headroom.integrations.strands import HeadroomStrandsModel
model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-20250514-v1:0")
optimized = HeadroomStrandsModel(wrapped_model=model)
agent = Agent(model=optimized)
response = agent("Investigate the production incident")
print(f"Tokens saved: {optimized.total_tokens_saved}")
```
Model wrapping [#model-wrapping]
Wraps the Strands `Model` interface. Every call to `stream()` compresses messages before they reach the provider:
```python
from headroom import HeadroomConfig
from headroom.integrations.strands import HeadroomStrandsModel
optimized = HeadroomStrandsModel(
wrapped_model=model,
config=HeadroomConfig(),
)
agent = Agent(model=optimized)
response = agent("Analyze these logs")
```
Hook provider (tool output compression) [#hook-provider-tool-output-compression]
Compresses tool call results via Strands' hook system. Uses SmartCrusher on JSON arrays returned by tools:
```python
from strands import Agent
from strands.models.bedrock import BedrockModel
from headroom.integrations.strands import HeadroomHookProvider
model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-20250514-v1:0")
hooks = HeadroomHookProvider(
compress_tool_outputs=True,
min_tokens_to_compress=200,
preserve_errors=True,
)
agent = Agent(model=model, hooks=[hooks])
response = agent("Search the database for recent failures")
print(f"Tokens saved by hooks: {hooks.total_tokens_saved}")
```
The hook preserves error items, anomalous values (statistical outliers), items matching the query context, and boundary items (first/last).
Both together [#both-together]
Model wrapping compresses conversation history. Hooks compress individual tool results. Use both for maximum savings:
```python
from headroom.integrations.strands import HeadroomStrandsModel, HeadroomHookProvider
optimized = HeadroomStrandsModel(wrapped_model=model)
hooks = HeadroomHookProvider(compress_tool_outputs=True)
agent = Agent(model=optimized, hooks=[hooks])
```
How it works [#how-it-works]
```
Agent decides to call tool
|
v
Tool executes, returns result
|
v
HeadroomHookProvider (optional)
compresses tool result JSON
|
v
Agent builds next API request
|
v
HeadroomStrandsModel.stream()
compresses full message list
|
v
Provider API (Bedrock, etc.)
```
The model wrapper uses the full Headroom pipeline (CacheAligner, ContentRouter, IntelligentContext). The hook provider uses SmartCrusher directly for fast JSON compression.
Structured output [#structured-output]
```python
from pydantic import BaseModel
class Analysis(BaseModel):
severity: str
root_cause: str
recommendation: str
result = optimized.structured_output(Analysis, messages)
```
Metrics [#metrics]
```python
for m in optimized.metrics_history:
print(f" {m.tokens_before} -> {m.tokens_after} ({m.tokens_saved} saved)")
print(f"Total saved: {optimized.total_tokens_saved}")
```
Supported providers [#supported-providers]
| Strands Model | Provider Detected |
| -------------- | ------------------------ |
| `BedrockModel` | Anthropic (via Bedrock) |
| `OllamaModel` | OpenAI-compatible |
| Custom `Model` | Falls back to estimation |
# Text & Log Compression (/docs/text-and-logs)
Headroom provides specialized compressors for text-based content that isn't JSON or source code. Each one understands the structure of its content type and preserves what the LLM needs while dropping the noise.
| Compressor | Input Type | What It Preserves | Typical Savings |
| --------------------- | -------------------------- | -------------------------------- | --------------- |
| `SearchCompressor` | grep/ripgrep output | Relevant matches, file diversity | 80-95% |
| `LogCompressor` | Build/test logs | Errors, stack traces, summaries | 85-95% |
| `DiffCompressor` | Unified diffs | Changed lines, context | 60-80% |
| `TextCompressor` | General text | Relevant paragraphs, anchors | 60-80% |
| `LLMLinguaCompressor` | Any text (max compression) | Semantic meaning via ML | 80-95% |
SearchCompressor [#searchcompressor]
Compresses search results (grep, ripgrep, ag) while keeping the matches that matter.
```python
from headroom.transforms import SearchCompressor
search_results = """
src/utils.py:42:def process_data(items):
src/utils.py:43: \"\"\"Process items.\"\"\"
src/models.py:15:class DataProcessor:
src/models.py:89: def process(self, items):
... hundreds more matches ...
"""
compressor = SearchCompressor()
result = compressor.compress(search_results, context="find process")
print(f"Compressed {result.original_match_count} matches to {result.compressed_match_count}")
print(result.compressed)
```
**What gets preserved:**
* Exact query matches (lines containing the search term)
* High-relevance matches (scored by BM25 similarity)
* File diversity (results from different files are kept)
* First/last matches (context from start and end)
Configuration [#configuration]
```python
from headroom.transforms import SearchCompressor, SearchCompressorConfig
config = SearchCompressorConfig(
max_results=50, # Keep up to 50 matches
preserve_file_diversity=True, # Ensure different files represented
relevance_threshold=0.3, # Minimum relevance score to keep
)
compressor = SearchCompressor(config)
```
LogCompressor [#logcompressor]
Compresses build and test output while preserving errors, warnings, and summaries.
```python
from headroom.transforms import LogCompressor
build_output = """
===== test session starts =====
collected 500 items
tests/test_foo.py::test_1 PASSED
... hundreds of passed tests ...
tests/test_bar.py::test_fail FAILED
AssertionError: expected 5, got 3
===== 1 failed, 499 passed =====
"""
compressor = LogCompressor()
result = compressor.compress(build_output)
print(result.compressed)
print(f"Compression ratio: {result.compression_ratio:.1%}")
```
**What gets preserved:**
* Errors and failures (any line with ERROR, FAILED, Exception)
* Warnings
* Full stack traces for debugging
* Test/build summary lines
* Section headers (structural markers like `=====`)
**What gets dropped:**
* Hundreds of `PASSED` lines
* Verbose success output
* Repeated patterns
DiffCompressor [#diffcompressor]
Compresses unified diffs while keeping the actual changes and enough context to understand them.
```python
from headroom.transforms import DiffCompressor
diff_output = """
diff --git a/src/main.py b/src/main.py
--- a/src/main.py
+++ b/src/main.py
@@ -42,7 +42,7 @@
def process(items):
- return [x for x in items]
+ return [x.strip() for x in items if x]
"""
compressor = DiffCompressor()
result = compressor.compress(diff_output)
```
TextCompressor [#textcompressor]
General-purpose text compression with anchor preservation. Best for documentation, README files, and prose content.
```python
from headroom.transforms import TextCompressor
long_text = """
... thousands of lines of documentation ...
"""
compressor = TextCompressor()
result = compressor.compress(long_text, context="authentication")
print(result.compressed)
```
**What gets preserved:**
* Paragraphs relevant to the context query
* Headers and section markers
* Document structure and organization
LLMLingua (Optional, Maximum Compression) [#llmlingua-optional-maximum-compression]
For maximum compression on any text, Headroom integrates with Microsoft's LLMLingua-2, a BERT-based token classifier trained via GPT-4 distillation. It achieves up to 20x compression while preserving semantic meaning.
```python
from headroom.transforms import LLMLinguaCompressor, LLMLinguaConfig
config = LLMLinguaConfig(
device="auto", # auto, cuda, cpu, mps
code_compression_rate=0.4, # Conservative for code
json_compression_rate=0.35, # Moderate for JSON
text_compression_rate=0.25, # Aggressive for text
)
compressor = LLMLinguaCompressor(config)
result = compressor.compress(long_output)
print(f"Before: {result.original_tokens} tokens")
print(f"After: {result.compressed_tokens} tokens")
print(f"Saved: {result.savings_percentage:.1f}%")
```
LLMLingua adds \~2GB of model weights and 50-200ms latency per request. Install it only when you need maximum compression: `pip install "headroom-ai[llmlingua]"`
Memory Management [#memory-management]
```python
from headroom.transforms import unload_llmlingua_model, is_llmlingua_model_loaded
# Check if model is loaded
print(is_llmlingua_model_loaded()) # True
# Free ~1GB RAM when done
unload_llmlingua_model()
```
Content Type Detection [#content-type-detection]
If you're building your own routing logic, you can use the content type detector directly:
```python
from headroom.transforms import detect_content_type, ContentType
content = "src/main.py:42:def process():"
detection = detect_content_type(content)
if detection.content_type == ContentType.SEARCH_RESULTS:
result = SearchCompressor().compress(content, context="process")
elif detection.content_type == ContentType.BUILD_OUTPUT:
result = LogCompressor().compress(content)
elif detection.content_type == ContentType.PLAIN_TEXT:
result = TextCompressor().compress(content, context="process")
```
When Each Compressor Is Used [#when-each-compressor-is-used]
The ContentRouter selects the right compressor automatically. Here's when each fires:
| Content Pattern | Compressor | Detection Signal |
| -------------------------- | ------------------- | -------------------------------- |
| `file:line:content` lines | SearchCompressor | grep/ripgrep output format |
| pytest, npm, cargo markers | LogCompressor | Build tool output patterns |
| `---/+++` and `@@` markers | DiffCompressor | Unified diff format |
| Prose, documentation | TextCompressor | Fallback for non-structured text |
| Any (max compression mode) | LLMLinguaCompressor | Explicitly enabled |
Performance [#performance]
| Compressor | Typical Input | Output | Speed |
| ------------------- | ------------- | ------------------ | -------- |
| SearchCompressor | 1,000 matches | 30-50 matches | \~2ms |
| LogCompressor | 5,000 lines | 100-200 lines | \~3ms |
| DiffCompressor | Large diff | Changed hunks only | \~2ms |
| TextCompressor | 10,000 chars | 2,000 chars | \~2ms |
| LLMLinguaCompressor | Any text | 5-20% of original | 50-200ms |
# Troubleshooting (/docs/troubleshooting)
Solutions for common Headroom issues.
Proxy Server Issues [#proxy-server-issues]
Proxy will not start [#proxy-will-not-start]
**Symptom**: `headroom proxy` fails or hangs.
```bash
# Check if port is already in use
lsof -i :8787
# Try a different port
headroom proxy --port 8788
# Check for missing dependencies
pip install "headroom-ai[proxy]"
# Run with debug logging
headroom proxy --log-level debug
```
Connection refused when calling proxy [#connection-refused-when-calling-proxy]
**Symptom**: `curl: (7) Failed to connect to localhost port 8787`
```bash
# Verify proxy is running
curl http://localhost:8787/health
# Check if proxy started on a different port
ps aux | grep headroom
```
Proxy returns errors for some requests [#proxy-returns-errors-for-some-requests]
**Symptom**: Some requests work, others fail with 502/503.
```bash
# Check proxy logs for the actual error
headroom proxy --log-level debug
# Verify API key is set
echo $OPENAI_API_KEY # or ANTHROPIC_API_KEY
# Test the underlying API directly
curl https://api.openai.com/v1/models \
-H "Authorization: Bearer $OPENAI_API_KEY"
```
No Token Savings [#no-token-savings]
**Symptom**: `stats['session']['tokens_saved_total']` is 0.
**Diagnosis**:
```python
stats = client.get_stats()
print(f"Mode: {stats['config']['mode']}") # Should be "optimize"
print(f"SmartCrusher: {stats['transforms']['smart_crusher_enabled']}")
```
**Common causes**:
* Mode is `audit` (observation only, no modifications)
* Messages do not contain tool outputs
* Tool outputs are below the 200-token threshold
* Data is not compressible (high uniqueness, code, grep results)
**Solutions**:
```ts twoslash
import { compress } from 'headroom-ai';
// Ensure the proxy is running in optimize mode
// (default, unless --no-optimize was passed)
const result = await compress(messages, { model: 'gpt-4o' });
console.log(`Saved: ${result.tokensSaved} tokens`);
console.log(`Compressed: ${result.compressed}`);
```
```python
# 1. Ensure mode is "optimize"
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize", # NOT "audit"
)
# 2. Or override per-request
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
headroom_mode="optimize",
)
# 3. Lower the compression threshold
config = HeadroomConfig()
config.smart_crusher.min_tokens_to_crush = 100 # Default is 200
```
Compression Too Aggressive [#compression-too-aggressive]
**Symptom**: LLM responses are missing information that was in tool outputs.
```python
# 1. Keep more items
config = HeadroomConfig()
config.smart_crusher.max_items_after_crush = 50 # Default: 15
# 2. Skip compression for specific tools
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
headroom_tool_profiles={
"important_tool": {"skip_compression": True},
},
)
# 3. Disable SmartCrusher entirely
config.smart_crusher.enabled = False
```
High Latency [#high-latency]
**Symptom**: Requests take longer than expected.
**Diagnosis**:
```python
import time
import logging
logging.basicConfig(level=logging.DEBUG)
start = time.time()
response = client.chat.completions.create(...)
print(f"Total time: {time.time() - start:.2f}s")
```
**Solutions**:
```python
# 1. Use BM25 instead of embeddings (faster)
config = HeadroomConfig()
config.smart_crusher.relevance.tier = "bm25"
# 2. Increase threshold to skip small payloads
config.smart_crusher.min_tokens_to_crush = 500
# 3. Disable transforms you don't need
config.cache_aligner.enabled = False
config.rolling_window.enabled = False
```
Installation Issues [#installation-issues]
pip install fails with C++ compilation error [#pip-install-fails-with-c-compilation-error]
**Symptom**: `RuntimeError: Unsupported compiler -- at least C++11 support is needed!`
```bash
# Linux / Debian-based (including Docker)
apt-get install -y build-essential && pip install headroom-ai
# macOS (Xcode command line tools)
xcode-select --install && pip install headroom-ai
```
For Docker, install and remove build tools in one layer:
```dockerfile
FROM python:3.11-slim
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
&& pip install "headroom-ai[proxy]" \
&& apt-get purge -y build-essential && apt-get autoremove -y \
&& rm -rf /var/lib/apt/lists/*
```
ModuleNotFoundError: No module named 'headroom' [#modulenotfounderror-no-module-named-headroom]
```bash
# Check it is installed in the right environment
pip show headroom-ai
# If using virtual environment, ensure it is activated
source venv/bin/activate
# Reinstall
pip install --upgrade headroom-ai
```
Missing optional dependency [#missing-optional-dependency]
```bash
# For proxy server
pip install "headroom-ai[proxy]"
# For embedding-based relevance scoring
pip install "headroom-ai[relevance]"
# For code compression (tree-sitter)
pip install "headroom-ai[code]"
# For everything
pip install "headroom-ai[all]"
```
Provider-Specific Issues [#provider-specific-issues]
OpenAI: Invalid API key [#openai-invalid-api-key]
```python
import os
from openai import OpenAI
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not set")
client = HeadroomClient(
original_client=OpenAI(api_key=api_key),
provider=OpenAIProvider(),
)
```
Anthropic: Authentication error [#anthropic-authentication-error]
```python
import os
from anthropic import Anthropic
api_key = os.environ.get("ANTHROPIC_API_KEY")
client = HeadroomClient(
original_client=Anthropic(api_key=api_key),
provider=AnthropicProvider(),
)
```
Unknown model warnings [#unknown-model-warnings]
```python
# For custom/fine-tuned models, specify context limit
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
model_context_limits={
"ft:gpt-4o-2024-08-06:my-org::abc123": 128000,
"my-custom-model": 32000,
},
)
```
ValidationError on Setup [#validationerror-on-setup]
```python
result = client.validate_setup()
print(result)
# Common issues:
# {"provider": {"ok": False, "error": "No API key"}}
# -> Set OPENAI_API_KEY or pass api_key to OpenAI()
#
# {"storage": {"ok": False, "error": "unable to open database"}}
# -> Check path permissions, use :memory: for testing
#
# {"config": {"ok": False, "error": "Invalid mode"}}
# -> Use "audit" or "optimize" only
```
For testing, use in-memory storage:
```python
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
store_url="sqlite:///:memory:",
)
```
Debugging Techniques [#debugging-techniques]
Enable Full Logging [#enable-full-logging]
```python
import logging
# See everything
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s %(name)s %(levelname)s %(message)s",
)
# Or just Headroom logs
logging.getLogger("headroom").setLevel(logging.DEBUG)
```
Use Simulation to Inspect Transforms [#use-simulation-to-inspect-transforms]
```python
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=messages,
)
print(f"Tokens: {plan.tokens_before} -> {plan.tokens_after}")
print(f"Transforms: {plan.transforms_applied}")
print(f"Waste signals: {plan.waste_signals}")
import json
print(json.dumps(plan.messages_optimized, indent=2))
```
Test Transforms Directly [#test-transforms-directly]
```python
from headroom import SmartCrusher, Tokenizer
from headroom.config import SmartCrusherConfig
import json
config = SmartCrusherConfig()
crusher = SmartCrusher(config)
tokenizer = Tokenizer()
messages = [
{
"role": "tool",
"content": json.dumps({"items": list(range(100))}),
"tool_call_id": "1",
}
]
result = crusher.apply(messages, tokenizer)
print(f"Tokens: {result.tokens_before} -> {result.tokens_after}")
```
Getting Help [#getting-help]
1. Enable debug logging and check the output
2. Use `simulate()` to see what transforms would apply
3. Run `validate_setup()` for configuration issues
4. File an issue at [github.com/headroom-sdk/headroom](https://github.com/headroom-sdk/headroom/issues) with your Headroom version, Python version, provider, debug log output, and minimal reproduction code
# Vercel AI SDK (/docs/vercel-ai-sdk)
Headroom integrates with the [Vercel AI SDK](https://sdk.vercel.ai) through three patterns: a one-liner wrapper, composable middleware, and standalone message compression.
Installation [#installation]
```bash
npm install headroom-ai ai @ai-sdk/openai
```
The TypeScript SDK sends messages to a local Headroom proxy for compression. Start the proxy before using the SDK:
```bash
pip install "headroom-ai[proxy]"
headroom proxy
```
withHeadroom() one-liner [#withheadroom-one-liner]
The simplest integration. Wraps any Vercel AI SDK language model with automatic compression:
```ts twoslash
import { withHeadroom } from 'headroom-ai/vercel-ai';
import { openai } from '@ai-sdk/openai';
import { generateText } from 'ai';
const model = withHeadroom(openai('gpt-4o'));
const { text } = await generateText({
model,
messages: [
{ role: 'user', content: 'Summarize these results...' },
],
});
```
`withHeadroom()` calls `wrapLanguageModel` + `headroomMiddleware()` under the hood. It works with any provider (`@ai-sdk/openai`, `@ai-sdk/anthropic`, `@ai-sdk/google`, etc.).
headroomMiddleware() for composition [#headroommiddleware-for-composition]
Use the middleware directly when you need to compose it with other middleware:
```ts twoslash
// @noErrors
import { headroomMiddleware } from 'headroom-ai/vercel-ai';
import { wrapLanguageModel } from 'ai';
import { openai } from '@ai-sdk/openai';
const model = wrapLanguageModel({
model: openai('gpt-4o'),
middleware: headroomMiddleware(),
});
```
Pass options to control compression behavior:
```ts twoslash
import { headroomMiddleware } from 'headroom-ai/vercel-ai';
const middleware = headroomMiddleware({
model: 'gpt-4o',
baseUrl: 'http://localhost:8787',
});
```
compressVercelMessages() standalone [#compressvercelmessages-standalone]
Compress Vercel-format messages directly without wrapping a model. Useful for custom pipelines:
```ts twoslash
import { compressVercelMessages } from 'headroom-ai/vercel-ai';
const result = await compressVercelMessages(messages, {
model: 'gpt-4o',
});
console.log(`Saved ${result.tokensSaved} tokens`);
// result.messages is in Vercel format, ready for the AI SDK
```
Streaming with streamText [#streaming-with-streamtext]
Compression happens before the request. Streaming responses are unaffected:
```ts twoslash
import { withHeadroom } from 'headroom-ai/vercel-ai';
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
const model = withHeadroom(openai('gpt-4o'));
const result = streamText({
model,
messages: longConversation,
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}
```
generateObject with compressed context [#generateobject-with-compressed-context]
Works with structured output:
```ts twoslash
// @noErrors
import { withHeadroom } from 'headroom-ai/vercel-ai';
import { openai } from '@ai-sdk/openai';
import { generateText, Output } from 'ai';
import { z } from 'zod';
const model = withHeadroom(openai('gpt-4o'));
const { output } = await generateText({
model,
output: Output.object({
schema: z.object({
summary: z.string(),
severity: z.enum(['low', 'medium', 'high']),
}),
}),
messages: largeConversationHistory,
});
```
How it works [#how-it-works]
1. Messages are converted from Vercel format to OpenAI format
2. Headroom compresses them via the proxy's `/v1/compress` endpoint
3. Compressed messages are converted back to Vercel format
4. The original model receives the smaller prompt
All other model behavior (tool calling, structured output, streaming) is unchanged.