Headroom

Benchmarks

Compression performance, accuracy preservation, latency overhead, and real-world production telemetry from 250+ Headroom proxy instances.

Headroom's core promise: compress context without losing accuracy. This page covers compression benchmarks, accuracy evaluations, latency overhead, and production telemetry.

Compression Performance

Tested on Apple M-series (CPU), Headroom v0.5.18. Each test runs compress() on realistic tool outputs.

Content TypeOriginalCompressedSavedRatioLatency
JSON array (100 items)3,1632972,86690.6%1ms
JSON array (500 items)9,5261,6147,91283.1%2ms
Shell output (200 lines)3,2384692,76985.5%1ms
Build log (200 lines)2,4121482,26493.9%1ms
grep results (150 hits)2,6242,62400.0%<1ms
Python source (~480 lines)2,9582,95800.0%<1ms
Total23,9218,11015,81166.1%5ms

Zero compression is intentional

grep results and Python source show 0% compression. These are already compact structured formats. SmartCrusher only compresses JSON arrays; code passes through to preserve correctness.

Accuracy Benchmarks

HTML Extraction

Dataset: Scrapinghub Article Extraction Benchmark (181 HTML pages with ground truth)

MetricValue
F1 Score0.919
Precision0.879
Recall0.982
Compression94.9%

For LLM applications, recall is critical -- 98.2% means nearly all article content is preserved. The slight precision drop (some extra content) does not hurt LLM accuracy.

JSON Compression (SmartCrusher)

Test: 100 production log entries with critical error at position 67. Task: find the error, error code, resolution, and affected count.

MetricBaselineHeadroom
Input tokens10,1441,260
Correct answers4/44/4
Compression--87.6%

SmartCrusher preserves first N items (schema), last N items (recency), all anomalies (errors, warnings), and statistical distribution.

QA Accuracy Preservation

MetricOriginal HTMLExtractedDelta
F1 Score0.850.87+0.02
Exact Match60%62%+2%

Extraction can improve accuracy

Removing HTML noise sometimes helps LLMs focus on relevant content, leading to slightly higher scores on extraction benchmarks.

Latency Overhead

SDK Compression Latency

Measured per-scenario on Apple M-series (CPU):

ScenarioTokens InTokens OutSavedp50 (ms)p95 (ms)
JSON: Search Results (100 items)10.2K1.5K8.7K189231
JSON: Search Results (500 items)50.2K1.5K48.7K943955
JSON: Search Results (1K items)100.5K1.5K99.0K2,0122,198
JSON: API Responses (500 items)38.9K1.1K37.8K743776
JSON: Database Rows (1K rows)43.7K60543.1K9611,104
JSON: String Array (100 strings)1.1K2318201515
JSON: String Array (500 strings)4.9K2334.6K7280
JSON: Number Array (200 numbers)1.2K1921.1K3162
JSON: Mixed Array (250 items)2.3K3681.9K3840

Cost-Benefit Analysis

Net latency benefit = LLM time saved from fewer tokens minus compression overhead (at Claude Sonnet pricing, $3.0/MTok):

ScenarioCompress (ms)LLM Saved (ms)Net BenefitSavings per 1K Requests
JSON: Search Results (100 items)189261+72ms$26
JSON: Search Results (500 items)9431,461+518ms$146
JSON: Search Results (1K items)2,0122,969+957ms$297
JSON: API Responses (500 items)7431,134+391ms$113
JSON: Database Rows (1K rows)9611,292+331ms$129

Compression pays for itself in latency for 11 of 12 tested scenarios against Claude Sonnet. Slower and more expensive models (Opus) benefit even more.

Pipeline Step Timing

StepMedianP90Description
pipeline_total16.9ms289msFull compression pipeline
content_router11.7ms259msContent detection + routing
smart_crusher50.1ms50msJSON array compression
text_compressor32.0ms576msText compression (Kompress ONNX)
initial_token_count2.9ms16msToken counting (tiktoken)

ContentRouter accounts for 91--98% of pipeline cost on average. CacheAligner and RollingWindow are sub-millisecond.

Production Telemetry

Real-world data from 50,000+ proxy sessions across 250+ unique instances (March--April 2026). Collected via anonymous telemetry (opt-out: HEADROOM_TELEMETRY=off).

Proxy Overhead

PercentileLatency
Median (P50)52ms
P90309ms
P994,172ms
Mean161ms

The median 52ms overhead is negligible compared to LLM inference time (typically 2--10 seconds).

Compression Rate

PercentileCompression
P254.8%
Median4.8%
P756.9%
Mean11.3%

Median compression is modest because many requests are short conversational turns. Heavy tool-use sessions (file reads, shell output) see 40--80% compression.

Fleet Summary

MetricValue
Clean instances249
Total tokens saved1.4 billion
Total savings~$4,000
OS distributionLinux 57%, macOS 38%, Windows 5%

Reproducing Results

git clone https://github.com/chopratejas/headroom.git
cd headroom
pip install -e ".[evals,html]"
pytest tests/test_evals/ -v -s

On this page