Headroom

Introduction

Headroom is the context optimization layer for LLM applications. Compress tool outputs, DB results, file reads, and RAG results before they reach the model. Same answers, fraction of the tokens.

map illustration

The Context Optimization Layer for LLM Applications

Compress everything your AI agent reads. Same answers, fraction of the tokens.

87 %

Token Reduction

100 %

Accuracy

6

Algorithms

100 +

Providers

Headroom compresses everything your AI agent reads -- tool outputs, database results, file reads, RAG retrievals, API responses -- before it reaches the LLM. The model sees less noise, responds faster, and costs less.

Quick preview

import {  } from 'headroom-ai';

const  = [
  { : 'user' as , : 'Analyze these results' },
];

const  = await (, { : 'gpt-4o' });
.(`Saved ${.tokensSaved} tokens (${(.compressionRatio * 100).(0)}%)`);
from headroom import compress

result = compress(messages, model="gpt-4o")
response = client.messages.create(
    model="gpt-4o",
    messages=result.messages,
)
print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")

Community stats

41.8BTokens Saved
$176.6KCost Saved
1.2MRequests Optimized
889Active Instances
View detailed charts and breakdowns →

What gets compressed

Content typeWhat happensTypical savings
JSON arrays (tool outputs)Statistical analysis keeps errors, anomalies, boundaries70--90%
Source codeAST-aware compression preserves signatures, collapses bodies40--70%
Build/test logsKeeps failures and errors, drops passing noise80--95%
Search resultsRanks by relevance, keeps top matches60--80%
Plain textModernBERT token classification removes redundancy30--50%
Git diffsPreserves change hunks, drops unchanged context40--60%
ImagesML router selects optimal resize/quality tradeoff40--90%

Where Headroom fits

Your Agent / App
    |
    |  tool outputs, logs, DB reads, RAG results, file reads, API responses
    v
 Headroom  <-- proxy, Python library, TS SDK, or framework integration
    |
    v
 LLM Provider  (OpenAI, Anthropic, Google, Bedrock, 100+ via LiteLLM)

Headroom works as a transparent proxy (zero code changes), a Python function (compress()), a TypeScript function (compress()), or a framework integration (LangChain, Agno, Strands, LiteLLM, Vercel AI SDK, MCP).

Real-world results

100 production log entries. One critical error buried at position 67.

MetricBaselineHeadroom
Input tokens10,1441,260
Correct answers4/44/4

87.6% fewer tokens. Same answer. The FATAL error was automatically preserved -- not by keyword matching, but by statistical analysis of field variance.

ScenarioBeforeAfterSavings
Code search (100 results)17,7651,40892%
SRE incident debugging65,6945,11892%
Codebase exploration78,50241,25447%
GitHub issue triage54,17414,76173%

Key Features

Lossless Compression (CCR)

Compresses aggressively, stores originals, gives the LLM a tool to retrieve full details. Nothing is thrown away.

Learn more →

Smart Content Detection

Auto-detects JSON, code, logs, text, diffs, HTML. Routes each to the best compressor. Zero configuration needed.

Learn more →

Cache Optimization

Stabilizes prefixes so provider KV caches hit. Tracks frozen messages to preserve the 90% read discount.

Learn more →

Image Compression

40-90% token reduction via trained ML router. Automatically selects resize/quality tradeoff per image.

Learn more →

Persistent Memory

Hierarchical memory (user/session/agent/turn) with SQLite + HNSW backends. Survives across conversations.

Learn more →

Failure Learning

Reads past sessions, finds failed tool calls, correlates with what succeeded, writes learnings to CLAUDE.md.

Learn more →

Multi-Agent Context

Compress what moves between agents. Any framework.

ctx = SharedContext()
ctx.put("research", big_output)
summary = ctx.get("research")
Learn more →

Metrics & Observability

Prometheus endpoint, per-request logging, cost tracking, budget limits, pipeline timing breakdowns.

Learn more →

Framework Integrations

LangChain

Wrap any chat model. Supports memory, retrievers, tools, streaming, async.

from headroom.integrations.langchain import HeadroomChatModel
llm = HeadroomChatModel(ChatOpenAI())
LangChainGuide →

Agno

Full agent framework integration with observability hooks.

from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(Claude())
agent = Agent(model=model)
AgnoGuide →

Strands

Model wrapping + tool output hook provider for Strands Agents.

from headroom.integrations.strands import HeadroomStrandsModel
model = HeadroomStrandsModel(...)
agent = Agent(model=model)
StrandsGuide →

MCP Tools

Three tools for Claude Code, Cursor, or any MCP client: headroom_compress, headroom_retrieve, headroom_stats.

headroom mcp install && claude
MCP ToolsGuide →

TypeScript SDK

compress(), Vercel AI SDK middleware, OpenAI and Anthropic client wrappers.

npm install headroom-ai
TypeScript SDKGuide →

Vercel AI SDK

One-liner withHeadroom() or headroomMiddleware() for any Vercel AI SDK model.

import { withHeadroom } from 'headroom-ai/vercel-ai'
const model = withHeadroom(openai('gpt-4o'))
Vercel AI SDKGuide →
All integration patterns →

Nothing is lost

Compressed content goes into the CCR store (Compress-Cache-Retrieve). The LLM gets a headroom_retrieve tool and can fetch full originals when it needs more detail. Compression is aggressive but reversible.

Next steps

On this page