Limitations
When Headroom helps, when it does not, and what to watch out for. Honest documentation of compression constraints and safety gates.
Headroom is designed to compress LLM context without losing accuracy. This page documents when it helps, when it does not, and the safety gates that prevent harmful compression.
When Headroom Helps vs. Does Not
| Content Type | Compression | Latency Impact | Best For |
|---|---|---|---|
| JSON: Arrays of dicts (search results, API responses, DB rows) | 86--100% | Net latency win on Sonnet/Opus | Primary use case |
| JSON: Arrays of strings (file paths, log lines, tags) | 60--90% | Net latency win | String dedup + sampling |
| JSON: Arrays of numbers (metrics, time series) | 70--85% | Net latency win | Statistical summary |
| JSON: Mixed-type arrays | 50--70% | Net latency win | Group-by-type compression |
| Structured logs (as JSON) | 82--95% | Net latency win | Log entries in tool outputs |
| Agentic conversations (25--50 turns) | 56--81% | Break-even to net win | Multi-tool agent sessions |
| Plain text (documentation, articles) | 43--46% | Adds latency (cost savings only) | Cost optimization |
| Code | Passthrough | Minimal overhead | See below |
| RAG document contexts | Passthrough | Minimal overhead | Not compressed |
Where Headroom Adds the Most Value
- Long agent sessions with accumulated tool outputs (40--80% compression)
- JSON-heavy workflows -- API responses, database queries (83--94% compression)
- Build and test output (85--94% compression)
- Multi-tool agents (60--76% compression across tool results)
Where Headroom Adds Little Value
- Short conversational exchanges (median 4.8% compression)
- Code-only sessions (reading/writing files) -- code passes through
- Single-turn requests with no accumulated context
What Headroom Does NOT Compress
- Short messages (< 300 tokens) -- overhead exceeds savings
- Source code -- passes through unchanged to preserve correctness
- grep/search results -- compact structured format, already minimal
- Images -- counted at fixed token cost (~1,600 tokens), not compressed
- System prompts -- preserved for prefix cache compatibility
Code Compression
Headroom includes an AST-aware CodeCompressor (tree-sitter, 8 languages) but it is gated behind safety protections that prevent it from firing in most real-world scenarios. This is intentional.
Why code mostly passes through:
- Word count gate: Content under 50 words is silently skipped
- Recent code protection (
protect_recent_code=4): Code in the last 4 messages is never compressed - Analysis intent protection (
protect_analysis_context=True): If the most recent user message contains keywords like "analyze", "review", "explain", "fix", "debug" -- ALL code in the conversation is protected
Why this is the right default: Code is almost always fetched because the user wants to work with it. Compressing function bodies would remove exactly what they need.
Where code savings come from: The IntelligentContextManager drops old code messages that are no longer relevant (scoring-based), which is a better strategy than stripping function bodies.
Override: Set protect_analysis_context=False in ContentRouterConfig for aggressive code compression. Requires headroom-ai[code] for tree-sitter.
JSON Compression Constraints
What Gets Compressed
- Arrays of dicts: Full statistical analysis with adaptive K (Kneedle algorithm)
- Arrays of strings: Dedup + adaptive sampling + error preservation
- Arrays of numbers: Statistical summary + outlier/change-point preservation
- Mixed-type arrays: Grouped by type, each group compressed independently
- Nested objects: Recursed into, arrays within are compressed (up to depth 5)
What Passes Through
- Arrays below 5 items (
min_items_to_analyze) - Content below 200 tokens (
min_tokens_to_crush) - Bool-only arrays
- JSON objects without array values
- Malformed JSON (silently passes through, no error)
Edge Cases
- NaN/Infinity in numeric fields: Filtered out before statistics are computed
- Nesting depth > 5: Inner arrays not examined for compression
- Mixed-type arrays with small groups: Groups below
min_items_to_analyzeare kept as-is
Safety Gates
All compressors follow the same principle: fail gracefully, return original content unchanged.
- Invalid JSON passes through (no error raised)
- AST parse failure falls back to original or LLMLingua
- Compression that makes output larger returns the original
- Missing optional dependencies (tree-sitter, LLMLingua) cause a passthrough with warning log
- Errors are logged at WARNING level and never propagated to callers
One exception
LLMLingua out-of-memory during model loading raises a RuntimeError. All other failures are silently handled.
Adaptive K: How Item Retention Works
SmartCrusher does not use fixed K values. It uses information-theoretic sizing:
- Kneedle algorithm on bigram coverage curves finds the point where adding more items stops providing new information
- SimHash fingerprinting detects near-duplicate items
- zlib validation ensures the subset captures the full set's diversity
The resulting K is split: 30% from array start, 15% from end, 55% for importance-scored items.
Safety guarantees (additive, never dropped):
- Error items (containing "error", "exception", "failed", "critical") -- across ALL array types
- Numeric anomalies (> 2 standard deviations from mean)
- String length anomalies (> 2 standard deviations from mean length)
- Change points (sudden shifts in running values)
These are kept even if they exceed the K budget.
Configuration Tuning
| Parameter | Default | Effect |
|---|---|---|
min_items_to_analyze | 5 | Arrays below this pass through |
min_tokens_to_crush | 200 | Content below this passes through |
max_items_after_crush | 15 | Upper bound on retained items |
variance_threshold | 2.0 | Std devs for anomaly detection (lower = more preserved) |
protect_analysis_context | True | Protect code when user asks about it |
protect_recent_code | 4 | Messages from end to protect code in |
skip_user_messages | True | Never compress user messages |
toin_confidence_threshold | 0.3 | Minimum TOIN confidence to apply hints |
Provider Interactions
- CacheAligner maximizes Anthropic/OpenAI prefix cache hit rates
- Token counting uses model-specific tokenizers (tiktoken for OpenAI, calibrated estimation for Anthropic)
- Compression works with all providers -- no provider-specific limitations
- Compressed content is valid JSON -- downstream tools and parsers work unchanged
TOIN Cold Start
The Tool Output Intelligence Network (TOIN) learns compression patterns from usage. For new tool types:
- No learned patterns exist -- falls back to statistical heuristics
- Confidence below
toin_confidence_threshold(default 0.3) -- TOIN hints ignored - Patterns build up over time as tools are used repeatedly
- Cross-session learning requires persistence (
TelemetryConfig.storage_path)