Introduction
Headroom is the context optimization layer for LLM applications. Compress tool outputs, DB results, file reads, and RAG results before they reach the model. Same answers, fraction of the tokens.
The Context Optimization Layer for LLM Applications
Compress everything your AI agent reads. Same answers, fraction of the tokens.
Token Reduction
Accuracy
Algorithms
Providers
Headroom compresses everything your AI agent reads -- tool outputs, database results, file reads, RAG retrievals, API responses -- before it reaches the LLM. The model sees less noise, responds faster, and costs less.
Quick preview
import { } from 'headroom-ai';
const = [
{ : 'user' as , : 'Analyze these results' },
];
const = await (, { : 'gpt-4o' });
.(`Saved ${.tokensSaved} tokens (${(.compressionRatio * 100).(0)}%)`);from headroom import compress
result = compress(messages, model="gpt-4o")
response = client.messages.create(
model="gpt-4o",
messages=result.messages,
)
print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")Community stats
What gets compressed
| Content type | What happens | Typical savings |
|---|---|---|
| JSON arrays (tool outputs) | Statistical analysis keeps errors, anomalies, boundaries | 70--90% |
| Source code | AST-aware compression preserves signatures, collapses bodies | 40--70% |
| Build/test logs | Keeps failures and errors, drops passing noise | 80--95% |
| Search results | Ranks by relevance, keeps top matches | 60--80% |
| Plain text | ModernBERT token classification removes redundancy | 30--50% |
| Git diffs | Preserves change hunks, drops unchanged context | 40--60% |
| Images | ML router selects optimal resize/quality tradeoff | 40--90% |
Where Headroom fits
Your Agent / App
|
| tool outputs, logs, DB reads, RAG results, file reads, API responses
v
Headroom <-- proxy, Python library, TS SDK, or framework integration
|
v
LLM Provider (OpenAI, Anthropic, Google, Bedrock, 100+ via LiteLLM)Headroom works as a transparent proxy (zero code changes), a Python function (compress()), a TypeScript function (compress()), or a framework integration (LangChain, Agno, Strands, LiteLLM, Vercel AI SDK, MCP).
Real-world results
100 production log entries. One critical error buried at position 67.
| Metric | Baseline | Headroom |
|---|---|---|
| Input tokens | 10,144 | 1,260 |
| Correct answers | 4/4 | 4/4 |
87.6% fewer tokens. Same answer. The FATAL error was automatically preserved -- not by keyword matching, but by statistical analysis of field variance.
| Scenario | Before | After | Savings |
|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% |
| SRE incident debugging | 65,694 | 5,118 | 92% |
| Codebase exploration | 78,502 | 41,254 | 47% |
| GitHub issue triage | 54,174 | 14,761 | 73% |
Key Features
Lossless Compression (CCR)
Compresses aggressively, stores originals, gives the LLM a tool to retrieve full details. Nothing is thrown away.
Learn more →Smart Content Detection
Auto-detects JSON, code, logs, text, diffs, HTML. Routes each to the best compressor. Zero configuration needed.
Learn more →Cache Optimization
Stabilizes prefixes so provider KV caches hit. Tracks frozen messages to preserve the 90% read discount.
Learn more →Image Compression
40-90% token reduction via trained ML router. Automatically selects resize/quality tradeoff per image.
Learn more →Persistent Memory
Hierarchical memory (user/session/agent/turn) with SQLite + HNSW backends. Survives across conversations.
Learn more →Failure Learning
Reads past sessions, finds failed tool calls, correlates with what succeeded, writes learnings to CLAUDE.md.
Learn more →Multi-Agent Context
Compress what moves between agents. Any framework.
ctx = SharedContext()
ctx.put("research", big_output)
summary = ctx.get("research")Learn more →Metrics & Observability
Prometheus endpoint, per-request logging, cost tracking, budget limits, pipeline timing breakdowns.
Learn more →Framework Integrations
LangChain
Wrap any chat model. Supports memory, retrievers, tools, streaming, async.
from headroom.integrations.langchain import HeadroomChatModel llm = HeadroomChatModel(ChatOpenAI())LangChainGuide →
Agno
Full agent framework integration with observability hooks.
from headroom.integrations.agno import HeadroomAgnoModel model = HeadroomAgnoModel(Claude()) agent = Agent(model=model)AgnoGuide →
Strands
Model wrapping + tool output hook provider for Strands Agents.
from headroom.integrations.strands import HeadroomStrandsModel model = HeadroomStrandsModel(...) agent = Agent(model=model)StrandsGuide →
MCP Tools
Three tools for Claude Code, Cursor, or any MCP client: headroom_compress, headroom_retrieve, headroom_stats.
headroom mcp install && claudeMCP ToolsGuide →
TypeScript SDK
compress(), Vercel AI SDK middleware, OpenAI and Anthropic client wrappers.
npm install headroom-aiTypeScript SDKGuide →
Vercel AI SDK
One-liner withHeadroom() or headroomMiddleware() for any Vercel AI SDK model.
import { withHeadroom } from 'headroom-ai/vercel-ai'
const model = withHeadroom(openai('gpt-4o'))Vercel AI SDKGuide →Nothing is lost
Compressed content goes into the CCR store (Compress-Cache-Retrieve). The LLM gets a headroom_retrieve tool and can fetch full originals when it needs more detail. Compression is aggressive but reversible.