Headroom

Reversible Compression (CCR)

Compress-Cache-Retrieve architecture that makes compression lossless — the LLM can always get the original data back.

Headroom's CCR (Compress-Cache-Retrieve) architecture makes compression reversible. When content is compressed, the original data is cached locally. If the LLM needs the full data, it retrieves it instantly.

Nothing is ever thrown away

Unlike traditional lossy compression, CCR guarantees that every piece of original data remains accessible. You get 70-90% token savings with zero risk of permanent data loss.

The problem with traditional compression

Traditional compression forces a difficult tradeoff:

  • Aggressive compression risks losing data the LLM needs
  • Conservative compression misses out on token savings

CCR eliminates this tradeoff entirely. Compress aggressively, retrieve on demand.

Architecture

CCR flows through four phases:

TOOL OUTPUT (1000 items)
  -> SmartCrusher compresses to 20 items
  -> Original cached with hash=abc123
  -> Retrieval tool injected into context

LLM PROCESSING
  Option A: LLM solves task with 20 items -> Done (90% savings)
  Option B: LLM calls headroom_retrieve(hash=abc123)
            -> Response Handler returns full data automatically

Phase 1: Compression Store

When SmartCrusher compresses tool output:

  1. The original content is stored in an LRU cache
  2. A hash key is generated for retrieval
  3. A marker is added to the compressed output:
[1000 items compressed to 20. Retrieve more: hash=abc123]

Phase 2: Tool Injection

Headroom injects a headroom_retrieve tool into the LLM's available tools:

{
  "name": "headroom_retrieve",
  "description": "Retrieve original uncompressed data from Headroom cache",
  "parameters": {
    "hash": "The hash key from the compression marker",
    "query": "Optional: search within the cached data"
  }
}

The LLM sees this tool alongside your application's tools and can call it whenever the compressed data is insufficient.

Phase 3: Response Handler

When the LLM calls headroom_retrieve:

  1. The Response Handler intercepts the tool call
  2. Data is retrieved from the local cache (around 1ms)
  3. The result is added to the conversation
  4. The API call continues automatically

The client never sees CCR tool calls -- they are handled transparently by Headroom.

Phase 4: Context Tracker

Across multiple turns, the Context Tracker maintains awareness of all compressed content:

  1. Remembers what was compressed in earlier turns
  2. Analyzes new queries for relevance to compressed content
  3. Proactively expands relevant data before the LLM asks
Turn 1: User searches for files
        -> 500 files compressed to 15, cached (hash=abc123)
        -> LLM answers with 15 files

Turn 5: User asks "What about the auth middleware?"
        -> Context Tracker detects "auth" may match cached content
        -> Proactively expands compressed data
        -> LLM finds auth_middleware.py in the full list

BM25 search within compressed data

The LLM does not have to retrieve everything. It can search within compressed data using the optional query parameter:

{
  "name": "headroom_retrieve",
  "parameters": {
    "hash": "abc123",
    "query": "authentication errors"
  }
}

This runs a BM25 search over the cached items, returning only the relevant subset instead of the full original payload.

Retrieving originals

CCR works automatically through the proxy, but you can also retrieve cached data programmatically:

import {  } from "headroom-ai";
import type {  } from "headroom-ai";

// CCR is enabled by default when compressing through the proxy.
const  = await (messages, {
  : "gpt-4o",
});

// Access compressed messages — CCR markers are embedded automatically
.(.messages);

// CCR configuration options
const :  = {
  : true,
  : true,             // Inject headroom_retrieve tool
  : true,  // Add retrieval markers to compressed output
  : true,        // Learn from retrieval patterns
  : 1000,        // Max cached items
  : 3600,        // Cache TTL
};
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",
)

# CCR happens automatically during chat completions.
# The LLM calls headroom_retrieve when it needs more data.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
)

# CCR is enabled by default. To disable:
# headroom proxy --no-ccr-responses

# To disable proactive expansion:
# headroom proxy --no-ccr-expansion

Message-level CCR

CCR is not limited to tool outputs. When IntelligentContext drops low-importance messages to fit the context budget, those messages are also stored in CCR:

100-message conversation (50K tokens)
  -> IntelligentContext scores messages by importance
  -> Drops 60 low-scoring messages
  -> Dropped messages cached with hash=def456
  -> Marker inserted: "60 messages dropped, retrieve: def456"

The marker includes the CCR reference so the LLM can recover earlier context:

[Earlier context compressed: 60 message(s) dropped by importance scoring.
Full content available via ccr_retrieve tool with reference 'def456'.]

When users retrieve dropped messages via CCR, TOIN learns those message patterns are important and scores them higher in future sessions -- improving drop decisions across all users.

CCR-enabled components

ComponentWhat it compressesCCR integration
SmartCrusherJSON arrays (tool outputs)Stores original array, marker includes hash
ContentRouterCode, logs, search results, textStores original content by strategy
IntelligentContextMessages (conversation turns)Stores dropped messages, marker includes hash

Why CCR matters

ApproachRiskSavings
No compressionNone0%
Traditional compressionData loss70-90%
CCR compressionNone (reversible)70-90%

CCR gives you the savings of aggressive compression with zero risk. The LLM can always retrieve the original data if needed.

On this page