Cache Optimization
Stabilize message prefixes for provider KV cache hits and configure provider-specific caching strategies.
LLM providers cache prompt prefixes to avoid reprocessing identical input on repeated calls. Headroom's CacheAligner stabilizes your message prefixes so these caches actually hit, and then applies provider-specific strategies to maximize savings.
How CacheAligner works
System prompts often contain dynamic content -- today's date, session IDs, timestamps -- that changes between requests. Even a single character difference at the start of a prompt invalidates the entire provider cache.
CacheAligner solves this by extracting dynamic content and moving it to the end of the message, keeping the prefix stable:
Before:
"You are helpful. Current Date: 2025-04-06" <- changes daily, no cache hit
After:
"You are helpful." <- stable prefix, cache hit
"[Context: Current Date: 2025-04-06]" <- dynamic part moved to tailThe prefix stays byte-identical across requests, so the provider's KV cache can reuse previously computed attention states.
Provider-specific strategies
Each LLM provider implements caching differently. Headroom applies the optimal strategy for each.
Anthropic
Anthropic supports explicit cache_control blocks that mark content as cacheable. Cached input tokens cost 90% less than regular input tokens.
Headroom automatically inserts cache_control breakpoints at the right positions in your messages so that stable prefixes (system prompts, early conversation turns) are cached across requests.
| Metric | Value |
|---|---|
| Cache read discount | 90% off input price |
| Cache write cost | 25% premium on first write |
| Cache TTL | 5 minutes (extended on hit) |
OpenAI
OpenAI uses automatic prefix caching -- if consecutive requests share the same message prefix, the provider reuses cached KV states. No explicit API markers are needed, but the prefix must be byte-identical.
CacheAligner ensures your prefixes remain stable by extracting dynamic content, which is the key requirement for OpenAI prefix caching to work.
| Metric | Value |
|---|---|
| Cache read discount | 50% off input price |
| Activation | Automatic (prefix match) |
| Min prefix length | 1024 tokens |
Google provides the CachedContent API, which lets you explicitly cache large context (system instructions, documents, tools) and reference it across requests. Cached tokens cost 75% less.
Headroom can manage CachedContent lifecycle automatically, creating and refreshing cached content objects as needed.
| Metric | Value |
|---|---|
| Cache read discount | 75% off input price |
| Mechanism | Explicit CachedContent API objects |
| Min cache size | 32,768 tokens |
Configuration
import { } from "headroom-ai";
import type {
,
,
,
} from "headroom-ai";
// CacheAligner: stabilize prefixes for cache hits
const : = {
: true,
: [
"Today is \\w+ \\d+, \\d{4}",
"Current time: .*",
],
: true,
: true,
};
// CacheOptimizer: provider-level caching
const : = {
: true,
: true, // Detect Anthropic/OpenAI/Google automatically
: 1024,
};
// Full configuration
const : = {
,
,
};
// Compress with cache optimization
const = await (messages, {
: "claude-sonnet-4-20250514",
,
});from headroom import HeadroomClient, OpenAIProvider, AnthropicProvider, GoogleProvider
from headroom.transforms import CacheAlignerConfig
from openai import OpenAI
# CacheAligner configuration
aligner_config = CacheAlignerConfig(
enabled=True,
dynamic_patterns=[
r"Today is \w+ \d+, \d{4}",
r"Current time: .*",
r"Session ID: [a-f0-9-]+",
],
)
# Provider-specific cache settings
# OpenAI: prefix caching (automatic, just keep prefixes stable)
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(enable_prefix_caching=True),
enable_cache_optimizer=True,
)
# Anthropic: cache_control blocks (90% read discount)
from anthropic import Anthropic
client = HeadroomClient(
original_client=Anthropic(),
provider=AnthropicProvider(enable_cache_control=True),
enable_cache_optimizer=True,
)
# Google: CachedContent API (75% read discount)
client = HeadroomClient(
original_client=google_client,
provider=GoogleProvider(enable_context_caching=True),
enable_cache_optimizer=True,
)How savings compound
CacheAligner and provider caching work together with Headroom's compression transforms:
- SmartCrusher reduces token count by 70-90%
- CacheAligner stabilizes prefixes so provider caches hit
- Provider caching discounts the remaining input tokens by 50-90%
For example, with Anthropic:
- 100K input tokens compressed to 20K (80% savings from SmartCrusher)
- 18K of those 20K hit the cache (90% cache read discount)
- Effective cost: 2K full-price tokens + 18K at 10% = 3.8K equivalent tokens
- Total savings: 96.2% compared to the original 100K tokens
Reversible Compression (CCR)
Compress-Cache-Retrieve architecture that makes compression lossless — the LLM can always get the original data back.
Context Management
Intelligent importance-based context management that scores messages by learned patterns, with rolling window fallback and output buffer reservation.