Headroom

Proxy Server

Run the Headroom proxy to compress LLM traffic for any client — Claude Code, Cursor, OpenAI SDK, or custom apps.

The Headroom proxy is a standalone HTTP server that compresses all LLM traffic passing through it. Point any client at the proxy and get automatic context optimization.

Starting the proxy

# Basic usage
headroom proxy

# Custom host and port
headroom proxy --host 0.0.0.0 --port 8080

# With logging and budget
headroom proxy \
  --log-file /var/log/headroom.jsonl \
  --budget 100.0

Telemetry is enabled by default. Opt out with HEADROOM_TELEMETRY=off or --no-telemetry.

CLI options

Core

OptionDefaultDescription
--host127.0.0.1Host to bind to
--port8787Port to bind to
--no-optimizefalseDisable optimization (passthrough mode)
--no-cachefalseDisable semantic caching
--no-rate-limitfalseDisable rate limiting
--log-fileNonePath to JSONL log file
--budgetNoneDaily budget limit in USD
--openai-api-urlhttps://api.openai.comCustom OpenAI API URL

Context management

OptionDefaultDescription
--no-intelligent-contextfalseFall back to RollingWindow (oldest-first drops)
--no-intelligent-scoringfalseDisable multi-factor importance scoring
--no-compress-firstfalseDisable trying deeper compression before dropping

By default, the proxy uses IntelligentContextManager which scores messages by recency, semantic similarity, TOIN-learned patterns, error indicators, and forward references. Dropped messages are stored in CCR for retrieval.

# Use legacy RollingWindow
headroom proxy --no-intelligent-context

# Faster but less intelligent scoring
headroom proxy --no-intelligent-scoring

LLMLingua (ML compression)

OptionDefaultDescription
--llmlinguafalseEnable LLMLingua-2 ML-based compression
--llmlingua-deviceautoDevice: auto, cuda, cpu, mps
--llmlingua-rate0.3Target compression rate (0.3 = keep 30%)
pip install "headroom-ai[llmlingua]"

headroom proxy --llmlingua --llmlingua-device cuda
headroom proxy --llmlingua --llmlingua-rate 0.2

LLMLingua resource cost

LLMLingua adds ~2 GB of dependencies (torch, transformers), 10-30s cold start, and ~1 GB RAM. Enable when maximum compression justifies the cost.

API endpoints

GET /health

curl http://localhost:8787/health
{
  "status": "healthy",
  "optimize": true,
  "stats": {
    "total_requests": 42,
    "tokens_saved": 15000,
    "savings_percent": 45.2
  }
}

GET /stats

Live session statistics plus durable persistent_savings totals. Stored at ~/.headroom/proxy_savings.json (override with HEADROOM_SAVINGS_PATH).

curl http://localhost:8787/stats

GET /stats-history

Durable history with hourly, daily, weekly, and monthly rollups. Powers the /dashboard view.

curl http://localhost:8787/stats-history
curl "http://localhost:8787/stats-history?format=csv&series=weekly"

GET /metrics

Prometheus-format metrics for monitoring.

curl http://localhost:8787/metrics
headroom_requests_total{mode="optimize"} 1234
headroom_tokens_saved_total 5678900
headroom_compression_ratio_bucket{le="0.5"} 890
headroom_latency_seconds_bucket{le="0.01"} 800
headroom_cache_hits_total 456

POST /v1/messages

Anthropic API format. The proxy compresses messages, forwards to Anthropic, and returns the response.

POST /v1/chat/completions

OpenAI API format. The proxy compresses messages, forwards to OpenAI, and returns the response.

POST /v1/compress

Compression-only endpoint. Compresses messages without calling any LLM. Used by the TypeScript SDK.

Request:

{
  "messages": [{ "role": "user", "content": "..." }],
  "model": "gpt-4o"
}

Response:

{
  "messages": [{ "role": "user", "content": "..." }],
  "tokens_before": 15000,
  "tokens_after": 3500,
  "tokens_saved": 11500,
  "compression_ratio": 0.23,
  "transforms_applied": ["router:smart_crusher:0.35"],
  "ccr_hashes": ["a1b2c3"]
}

Set x-headroom-bypass: true to skip compression.

Agent wrapping

Use headroom wrap to transparently proxy any CLI tool:

# Claude Code
headroom wrap claude

# OpenAI Codex
headroom wrap codex

# Aider
headroom wrap aider

# Cursor
headroom wrap cursor

Or set the base URL manually:

# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Cursor / any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 cursor

Cloud providers

# AWS Bedrock
headroom proxy --backend bedrock --region us-east-1

# Google Vertex AI
headroom proxy --backend vertex_ai --region us-central1

# Azure OpenAI
headroom proxy --backend azure

# OpenRouter (400+ models)
OPENROUTER_API_KEY=sk-or-... headroom proxy --backend openrouter

Environment variables

export HEADROOM_HOST=0.0.0.0
export HEADROOM_PORT=8787
export HEADROOM_BUDGET=100.0
export OPENAI_TARGET_API_URL=https://custom.openai.endpoint.com
headroom proxy

Production deployment

gunicorn

pip install gunicorn

gunicorn headroom.proxy.server:app \
  --workers 4 \
  --bind 0.0.0.0:8787 \
  --worker-class uvicorn.workers.UvicornWorker

Docker

FROM python:3.11-slim
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
    && pip install "headroom-ai[proxy]" \
    && apt-get purge -y build-essential && apt-get autoremove -y \
    && rm -rf /var/lib/apt/lists/*
EXPOSE 8787
CMD ["headroom", "proxy", "--host", "0.0.0.0"]

Build dependencies

build-essential is required at install time because headroom-ai includes hnswlib, a C++ extension compiled from source. It is removed after installation to keep the image slim.

On this page