Introduction

Headroom is the context optimization layer for LLM applications. Compress tool outputs, DB results, file reads, and RAG results before they reach the model. Same answers, fraction of the tokens.

The Context Optimization Layer for LLM Applications

Compress everything your AI agent reads. Same answers, fraction of the tokens.

87 %

Token Reduction

100 %

Accuracy

Algorithms

100 +

Providers

Headroom compresses everything your AI agent reads -- tool outputs, database results, file reads, RAG retrievals, API responses -- before it reaches the LLM. The model sees less noise, responds faster, and costs less.

Quick preview

import {  } from 'headroom-ai';

const  = [
  { : 'user' as , : 'Analyze these results' },
];

const  = await (, { : 'gpt-4o' });
.(`Saved ${.tokensSaved} tokens (${(.compressionRatio * 100).(0)}%)`);

from headroom import compress

result = compress(messages, model="gpt-4o")
response = client.messages.create(
    model="gpt-4o",
    messages=result.messages,
)
print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")

Community stats

41.8BTokens Saved

$176.6KCost Saved

1.2MRequests Optimized

889Active Instances

View detailed charts and breakdowns →

What gets compressed

Content type	What happens	Typical savings
JSON arrays (tool outputs)	Statistical analysis keeps errors, anomalies, boundaries	70--90%
Source code	AST-aware compression preserves signatures, collapses bodies	40--70%
Build/test logs	Keeps failures and errors, drops passing noise	80--95%
Search results	Ranks by relevance, keeps top matches	60--80%
Plain text	ModernBERT token classification removes redundancy	30--50%
Git diffs	Preserves change hunks, drops unchanged context	40--60%
Images	ML router selects optimal resize/quality tradeoff	40--90%

Where Headroom fits

Your Agent / App
    |
    |  tool outputs, logs, DB reads, RAG results, file reads, API responses
    v
 Headroom  <-- proxy, Python library, TS SDK, or framework integration
    |
    v
 LLM Provider  (OpenAI, Anthropic, Google, Bedrock, 100+ via LiteLLM)

Headroom works as a transparent proxy (zero code changes), a Python function (compress()), a TypeScript function (compress()), or a framework integration (LangChain, Agno, Strands, LiteLLM, Vercel AI SDK, MCP).

Real-world results

100 production log entries. One critical error buried at position 67.

Metric	Baseline	Headroom
Input tokens	10,144	1,260
Correct answers	4/4	4/4

87.6% fewer tokens. Same answer. The FATAL error was automatically preserved -- not by keyword matching, but by statistical analysis of field variance.

Scenario	Before	After	Savings
Code search (100 results)	17,765	1,408	92%
SRE incident debugging	65,694	5,118	92%
Codebase exploration	78,502	41,254	47%
GitHub issue triage	54,174	14,761	73%

Key Features

Lossless Compression (CCR)

Compresses aggressively, stores originals, gives the LLM a tool to retrieve full details. Nothing is thrown away.

Learn more →

Smart Content Detection

Auto-detects JSON, code, logs, text, diffs, HTML. Routes each to the best compressor. Zero configuration needed.

Learn more →

Cache Optimization

Stabilizes prefixes so provider KV caches hit. Tracks frozen messages to preserve the 90% read discount.

Learn more →

Image Compression

40-90% token reduction via trained ML router. Automatically selects resize/quality tradeoff per image.

Learn more →

Persistent Memory

Hierarchical memory (user/session/agent/turn) with SQLite + HNSW backends. Survives across conversations.

Learn more →

Failure Learning

Reads past sessions, finds failed tool calls, correlates with what succeeded, writes learnings to CLAUDE.md.

Learn more →

Multi-Agent Context

Compress what moves between agents. Any framework.

ctx = SharedContext()
ctx.put("research", big_output)
summary = ctx.get("research")

Learn more →

Metrics & Observability

Prometheus endpoint, per-request logging, cost tracking, budget limits, pipeline timing breakdowns.

Learn more →

Framework Integrations

LangChain

Wrap any chat model. Supports memory, retrievers, tools, streaming, async.

from headroom.integrations.langchain import HeadroomChatModel
llm = HeadroomChatModel(ChatOpenAI())

LangChainGuide →

Agno

Full agent framework integration with observability hooks.

from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(Claude())
agent = Agent(model=model)

AgnoGuide →

Strands

Model wrapping + tool output hook provider for Strands Agents.

from headroom.integrations.strands import HeadroomStrandsModel
model = HeadroomStrandsModel(...)
agent = Agent(model=model)

StrandsGuide →

MCP Tools

Three tools for Claude Code, Cursor, or any MCP client: headroom_compress, headroom_retrieve, headroom_stats.

headroom mcp install && claude

MCP ToolsGuide →

TypeScript SDK

compress(), Vercel AI SDK middleware, OpenAI and Anthropic client wrappers.

npm install headroom-ai

TypeScript SDKGuide →

Vercel AI SDK

One-liner withHeadroom() or headroomMiddleware() for any Vercel AI SDK model.

import { withHeadroom } from 'headroom-ai/vercel-ai'
const model = withHeadroom(openai('gpt-4o'))

Vercel AI SDKGuide →

All integration patterns →

The Context Optimization Layer for LLM Applications

Quick preview

Community stats

What gets compressed

Where Headroom fits

Real-world results

Key Features

Lossless Compression (CCR)

Smart Content Detection

Cache Optimization

Image Compression

Persistent Memory

Failure Learning

Multi-Agent Context

Metrics & Observability

Framework Integrations

LangChain

Agno

Strands

MCP Tools

TypeScript SDK

Vercel AI SDK

Nothing is lost

Next steps

Quickstart

Installation

Proxy Server

Vercel AI SDK

LangChain

How Compression Works

On this page