Benchmarks
Compression performance, accuracy preservation, latency overhead, and real-world production telemetry from 250+ Headroom proxy instances.
Headroom's core promise: compress context without losing accuracy. This page covers compression benchmarks, accuracy evaluations, latency overhead, and production telemetry.
Compression Performance
Tested on Apple M-series (CPU), Headroom v0.5.18. Each test runs compress() on realistic tool outputs.
| Content Type | Original | Compressed | Saved | Ratio | Latency |
|---|---|---|---|---|---|
| JSON array (100 items) | 3,163 | 297 | 2,866 | 90.6% | 1ms |
| JSON array (500 items) | 9,526 | 1,614 | 7,912 | 83.1% | 2ms |
| Shell output (200 lines) | 3,238 | 469 | 2,769 | 85.5% | 1ms |
| Build log (200 lines) | 2,412 | 148 | 2,264 | 93.9% | 1ms |
| grep results (150 hits) | 2,624 | 2,624 | 0 | 0.0% | <1ms |
| Python source (~480 lines) | 2,958 | 2,958 | 0 | 0.0% | <1ms |
| Total | 23,921 | 8,110 | 15,811 | 66.1% | 5ms |
Zero compression is intentional
grep results and Python source show 0% compression. These are already compact structured formats. SmartCrusher only compresses JSON arrays; code passes through to preserve correctness.
Accuracy Benchmarks
HTML Extraction
Dataset: Scrapinghub Article Extraction Benchmark (181 HTML pages with ground truth)
| Metric | Value |
|---|---|
| F1 Score | 0.919 |
| Precision | 0.879 |
| Recall | 0.982 |
| Compression | 94.9% |
For LLM applications, recall is critical -- 98.2% means nearly all article content is preserved. The slight precision drop (some extra content) does not hurt LLM accuracy.
JSON Compression (SmartCrusher)
Test: 100 production log entries with critical error at position 67. Task: find the error, error code, resolution, and affected count.
| Metric | Baseline | Headroom |
|---|---|---|
| Input tokens | 10,144 | 1,260 |
| Correct answers | 4/4 | 4/4 |
| Compression | -- | 87.6% |
SmartCrusher preserves first N items (schema), last N items (recency), all anomalies (errors, warnings), and statistical distribution.
QA Accuracy Preservation
| Metric | Original HTML | Extracted | Delta |
|---|---|---|---|
| F1 Score | 0.85 | 0.87 | +0.02 |
| Exact Match | 60% | 62% | +2% |
Extraction can improve accuracy
Removing HTML noise sometimes helps LLMs focus on relevant content, leading to slightly higher scores on extraction benchmarks.
Latency Overhead
SDK Compression Latency
Measured per-scenario on Apple M-series (CPU):
| Scenario | Tokens In | Tokens Out | Saved | p50 (ms) | p95 (ms) |
|---|---|---|---|---|---|
| JSON: Search Results (100 items) | 10.2K | 1.5K | 8.7K | 189 | 231 |
| JSON: Search Results (500 items) | 50.2K | 1.5K | 48.7K | 943 | 955 |
| JSON: Search Results (1K items) | 100.5K | 1.5K | 99.0K | 2,012 | 2,198 |
| JSON: API Responses (500 items) | 38.9K | 1.1K | 37.8K | 743 | 776 |
| JSON: Database Rows (1K rows) | 43.7K | 605 | 43.1K | 961 | 1,104 |
| JSON: String Array (100 strings) | 1.1K | 231 | 820 | 15 | 15 |
| JSON: String Array (500 strings) | 4.9K | 233 | 4.6K | 72 | 80 |
| JSON: Number Array (200 numbers) | 1.2K | 192 | 1.1K | 31 | 62 |
| JSON: Mixed Array (250 items) | 2.3K | 368 | 1.9K | 38 | 40 |
Cost-Benefit Analysis
Net latency benefit = LLM time saved from fewer tokens minus compression overhead (at Claude Sonnet pricing, $3.0/MTok):
| Scenario | Compress (ms) | LLM Saved (ms) | Net Benefit | Savings per 1K Requests |
|---|---|---|---|---|
| JSON: Search Results (100 items) | 189 | 261 | +72ms | $26 |
| JSON: Search Results (500 items) | 943 | 1,461 | +518ms | $146 |
| JSON: Search Results (1K items) | 2,012 | 2,969 | +957ms | $297 |
| JSON: API Responses (500 items) | 743 | 1,134 | +391ms | $113 |
| JSON: Database Rows (1K rows) | 961 | 1,292 | +331ms | $129 |
Compression pays for itself in latency for 11 of 12 tested scenarios against Claude Sonnet. Slower and more expensive models (Opus) benefit even more.
Pipeline Step Timing
| Step | Median | P90 | Description |
|---|---|---|---|
pipeline_total | 16.9ms | 289ms | Full compression pipeline |
content_router | 11.7ms | 259ms | Content detection + routing |
smart_crusher | 50.1ms | 50ms | JSON array compression |
text_compressor | 32.0ms | 576ms | Text compression (Kompress ONNX) |
initial_token_count | 2.9ms | 16ms | Token counting (tiktoken) |
ContentRouter accounts for 91--98% of pipeline cost on average. CacheAligner and RollingWindow are sub-millisecond.
Production Telemetry
Real-world data from 50,000+ proxy sessions across 250+ unique instances (March--April 2026). Collected via anonymous telemetry (opt-out: HEADROOM_TELEMETRY=off).
Proxy Overhead
| Percentile | Latency |
|---|---|
| Median (P50) | 52ms |
| P90 | 309ms |
| P99 | 4,172ms |
| Mean | 161ms |
The median 52ms overhead is negligible compared to LLM inference time (typically 2--10 seconds).
Compression Rate
| Percentile | Compression |
|---|---|
| P25 | 4.8% |
| Median | 4.8% |
| P75 | 6.9% |
| Mean | 11.3% |
Median compression is modest because many requests are short conversational turns. Heavy tool-use sessions (file reads, shell output) see 40--80% compression.
Fleet Summary
| Metric | Value |
|---|---|
| Clean instances | 249 |
| Total tokens saved | 1.4 billion |
| Total savings | ~$4,000 |
| OS distribution | Linux 57%, macOS 38%, Windows 5% |
Reproducing Results
git clone https://github.com/chopratejas/headroom.git
cd headroom
pip install -e ".[evals,html]"
pytest tests/test_evals/ -v -s