Reversible Compression (CCR)
Compress-Cache-Retrieve architecture that makes compression lossless — the LLM can always get the original data back.
Headroom's CCR (Compress-Cache-Retrieve) architecture makes compression reversible. When content is compressed, the original data is cached locally. If the LLM needs the full data, it retrieves it instantly.
Nothing is ever thrown away
Unlike traditional lossy compression, CCR guarantees that every piece of original data remains accessible. You get 70-90% token savings with zero risk of permanent data loss.
The problem with traditional compression
Traditional compression forces a difficult tradeoff:
- Aggressive compression risks losing data the LLM needs
- Conservative compression misses out on token savings
CCR eliminates this tradeoff entirely. Compress aggressively, retrieve on demand.
Architecture
CCR flows through four phases:
TOOL OUTPUT (1000 items)
-> SmartCrusher compresses to 20 items
-> Original cached with hash=abc123
-> Retrieval tool injected into context
LLM PROCESSING
Option A: LLM solves task with 20 items -> Done (90% savings)
Option B: LLM calls headroom_retrieve(hash=abc123)
-> Response Handler returns full data automaticallyPhase 1: Compression Store
When SmartCrusher compresses tool output:
- The original content is stored in an LRU cache
- A hash key is generated for retrieval
- A marker is added to the compressed output:
[1000 items compressed to 20. Retrieve more: hash=abc123]Phase 2: Tool Injection
Headroom injects a headroom_retrieve tool into the LLM's available tools:
{
"name": "headroom_retrieve",
"description": "Retrieve original uncompressed data from Headroom cache",
"parameters": {
"hash": "The hash key from the compression marker",
"query": "Optional: search within the cached data"
}
}The LLM sees this tool alongside your application's tools and can call it whenever the compressed data is insufficient.
Phase 3: Response Handler
When the LLM calls headroom_retrieve:
- The Response Handler intercepts the tool call
- Data is retrieved from the local cache (around 1ms)
- The result is added to the conversation
- The API call continues automatically
The client never sees CCR tool calls -- they are handled transparently by Headroom.
Phase 4: Context Tracker
Across multiple turns, the Context Tracker maintains awareness of all compressed content:
- Remembers what was compressed in earlier turns
- Analyzes new queries for relevance to compressed content
- Proactively expands relevant data before the LLM asks
Turn 1: User searches for files
-> 500 files compressed to 15, cached (hash=abc123)
-> LLM answers with 15 files
Turn 5: User asks "What about the auth middleware?"
-> Context Tracker detects "auth" may match cached content
-> Proactively expands compressed data
-> LLM finds auth_middleware.py in the full listBM25 search within compressed data
The LLM does not have to retrieve everything. It can search within compressed data using the optional query parameter:
{
"name": "headroom_retrieve",
"parameters": {
"hash": "abc123",
"query": "authentication errors"
}
}This runs a BM25 search over the cached items, returning only the relevant subset instead of the full original payload.
Retrieving originals
CCR works automatically through the proxy, but you can also retrieve cached data programmatically:
import { } from "headroom-ai";
import type { } from "headroom-ai";
// CCR is enabled by default when compressing through the proxy.
const = await (messages, {
: "gpt-4o",
});
// Access compressed messages — CCR markers are embedded automatically
.(.messages);
// CCR configuration options
const : = {
: true,
: true, // Inject headroom_retrieve tool
: true, // Add retrieval markers to compressed output
: true, // Learn from retrieval patterns
: 1000, // Max cached items
: 3600, // Cache TTL
};from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
)
# CCR happens automatically during chat completions.
# The LLM calls headroom_retrieve when it needs more data.
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
# CCR is enabled by default. To disable:
# headroom proxy --no-ccr-responses
# To disable proactive expansion:
# headroom proxy --no-ccr-expansionMessage-level CCR
CCR is not limited to tool outputs. When IntelligentContext drops low-importance messages to fit the context budget, those messages are also stored in CCR:
100-message conversation (50K tokens)
-> IntelligentContext scores messages by importance
-> Drops 60 low-scoring messages
-> Dropped messages cached with hash=def456
-> Marker inserted: "60 messages dropped, retrieve: def456"The marker includes the CCR reference so the LLM can recover earlier context:
[Earlier context compressed: 60 message(s) dropped by importance scoring.
Full content available via ccr_retrieve tool with reference 'def456'.]When users retrieve dropped messages via CCR, TOIN learns those message patterns are important and scores them higher in future sessions -- improving drop decisions across all users.
CCR-enabled components
| Component | What it compresses | CCR integration |
|---|---|---|
| SmartCrusher | JSON arrays (tool outputs) | Stores original array, marker includes hash |
| ContentRouter | Code, logs, search results, text | Stores original content by strategy |
| IntelligentContext | Messages (conversation turns) | Stores dropped messages, marker includes hash |
Why CCR matters
| Approach | Risk | Savings |
|---|---|---|
| No compression | None | 0% |
| Traditional compression | Data loss | 70-90% |
| CCR compression | None (reversible) | 70-90% |
CCR gives you the savings of aggressive compression with zero risk. The LLM can always retrieve the original data if needed.