Headroom

Limitations

When Headroom helps, when it does not, and what to watch out for. Honest documentation of compression constraints and safety gates.

Headroom is designed to compress LLM context without losing accuracy. This page documents when it helps, when it does not, and the safety gates that prevent harmful compression.

When Headroom Helps vs. Does Not

Content TypeCompressionLatency ImpactBest For
JSON: Arrays of dicts (search results, API responses, DB rows)86--100%Net latency win on Sonnet/OpusPrimary use case
JSON: Arrays of strings (file paths, log lines, tags)60--90%Net latency winString dedup + sampling
JSON: Arrays of numbers (metrics, time series)70--85%Net latency winStatistical summary
JSON: Mixed-type arrays50--70%Net latency winGroup-by-type compression
Structured logs (as JSON)82--95%Net latency winLog entries in tool outputs
Agentic conversations (25--50 turns)56--81%Break-even to net winMulti-tool agent sessions
Plain text (documentation, articles)43--46%Adds latency (cost savings only)Cost optimization
CodePassthroughMinimal overheadSee below
RAG document contextsPassthroughMinimal overheadNot compressed

Where Headroom Adds the Most Value

  • Long agent sessions with accumulated tool outputs (40--80% compression)
  • JSON-heavy workflows -- API responses, database queries (83--94% compression)
  • Build and test output (85--94% compression)
  • Multi-tool agents (60--76% compression across tool results)

Where Headroom Adds Little Value

  • Short conversational exchanges (median 4.8% compression)
  • Code-only sessions (reading/writing files) -- code passes through
  • Single-turn requests with no accumulated context

What Headroom Does NOT Compress

  • Short messages (< 300 tokens) -- overhead exceeds savings
  • Source code -- passes through unchanged to preserve correctness
  • grep/search results -- compact structured format, already minimal
  • Images -- counted at fixed token cost (~1,600 tokens), not compressed
  • System prompts -- preserved for prefix cache compatibility

Code Compression

Headroom includes an AST-aware CodeCompressor (tree-sitter, 8 languages) but it is gated behind safety protections that prevent it from firing in most real-world scenarios. This is intentional.

Why code mostly passes through:

  1. Word count gate: Content under 50 words is silently skipped
  2. Recent code protection (protect_recent_code=4): Code in the last 4 messages is never compressed
  3. Analysis intent protection (protect_analysis_context=True): If the most recent user message contains keywords like "analyze", "review", "explain", "fix", "debug" -- ALL code in the conversation is protected

Why this is the right default: Code is almost always fetched because the user wants to work with it. Compressing function bodies would remove exactly what they need.

Where code savings come from: The IntelligentContextManager drops old code messages that are no longer relevant (scoring-based), which is a better strategy than stripping function bodies.

Override: Set protect_analysis_context=False in ContentRouterConfig for aggressive code compression. Requires headroom-ai[code] for tree-sitter.

JSON Compression Constraints

What Gets Compressed

  • Arrays of dicts: Full statistical analysis with adaptive K (Kneedle algorithm)
  • Arrays of strings: Dedup + adaptive sampling + error preservation
  • Arrays of numbers: Statistical summary + outlier/change-point preservation
  • Mixed-type arrays: Grouped by type, each group compressed independently
  • Nested objects: Recursed into, arrays within are compressed (up to depth 5)

What Passes Through

  • Arrays below 5 items (min_items_to_analyze)
  • Content below 200 tokens (min_tokens_to_crush)
  • Bool-only arrays
  • JSON objects without array values
  • Malformed JSON (silently passes through, no error)

Edge Cases

  • NaN/Infinity in numeric fields: Filtered out before statistics are computed
  • Nesting depth > 5: Inner arrays not examined for compression
  • Mixed-type arrays with small groups: Groups below min_items_to_analyze are kept as-is

Safety Gates

All compressors follow the same principle: fail gracefully, return original content unchanged.

  • Invalid JSON passes through (no error raised)
  • AST parse failure falls back to original or LLMLingua
  • Compression that makes output larger returns the original
  • Missing optional dependencies (tree-sitter, LLMLingua) cause a passthrough with warning log
  • Errors are logged at WARNING level and never propagated to callers

One exception

LLMLingua out-of-memory during model loading raises a RuntimeError. All other failures are silently handled.

Adaptive K: How Item Retention Works

SmartCrusher does not use fixed K values. It uses information-theoretic sizing:

  1. Kneedle algorithm on bigram coverage curves finds the point where adding more items stops providing new information
  2. SimHash fingerprinting detects near-duplicate items
  3. zlib validation ensures the subset captures the full set's diversity

The resulting K is split: 30% from array start, 15% from end, 55% for importance-scored items.

Safety guarantees (additive, never dropped):

  • Error items (containing "error", "exception", "failed", "critical") -- across ALL array types
  • Numeric anomalies (> 2 standard deviations from mean)
  • String length anomalies (> 2 standard deviations from mean length)
  • Change points (sudden shifts in running values)

These are kept even if they exceed the K budget.

Configuration Tuning

ParameterDefaultEffect
min_items_to_analyze5Arrays below this pass through
min_tokens_to_crush200Content below this passes through
max_items_after_crush15Upper bound on retained items
variance_threshold2.0Std devs for anomaly detection (lower = more preserved)
protect_analysis_contextTrueProtect code when user asks about it
protect_recent_code4Messages from end to protect code in
skip_user_messagesTrueNever compress user messages
toin_confidence_threshold0.3Minimum TOIN confidence to apply hints

Provider Interactions

  • CacheAligner maximizes Anthropic/OpenAI prefix cache hit rates
  • Token counting uses model-specific tokenizers (tiktoken for OpenAI, calibrated estimation for Anthropic)
  • Compression works with all providers -- no provider-specific limitations
  • Compressed content is valid JSON -- downstream tools and parsers work unchanged

TOIN Cold Start

The Tool Output Intelligence Network (TOIN) learns compression patterns from usage. For new tool types:

  • No learned patterns exist -- falls back to statistical heuristics
  • Confidence below toin_confidence_threshold (default 0.3) -- TOIN hints ignored
  • Patterns build up over time as tools are used repeatedly
  • Cross-session learning requires persistence (TelemetryConfig.storage_path)

On this page