Configuration
All configuration options for the Headroom Python and TypeScript SDKs, proxy server, and per-request overrides.
Headroom can be configured via the SDK constructor, proxy command line, environment variables, or per-request overrides.
Modes
| Mode | Behavior | Use Case |
|---|---|---|
audit | Observes and logs, no modifications | Production monitoring, baseline measurement |
optimize | Applies safe, deterministic transforms | Production optimization |
simulate | Returns plan without API call | Testing, cost estimation |
SDK Configuration
import { } from 'headroom-ai';
// Reads from HEADROOM_BASE_URL and HEADROOM_API_KEY automatically
const = new ();
// Or configure explicitly
const = new ({
: 'http://localhost:8787',
: 'your-api-key',
: 30_000,
: true,
: 2,
});from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
# Mode: "audit" (observe only) or "optimize" (apply transforms)
default_mode="optimize",
# Enable provider-specific cache optimization
enable_cache_optimizer=True,
# Enable query-level semantic caching
enable_semantic_cache=False,
# Override default context limits per model
model_context_limits={
"gpt-4o": 128000,
"gpt-4o-mini": 128000,
},
# Database location (defaults to temp directory)
# store_url="sqlite:////absolute/path/to/headroom.db",
)Per-Request Overrides
Override configuration for individual requests:
import { } from 'headroom-ai';
const = await (messages, {
: 'gpt-4o',
: 100_000,
: 15_000,
});response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
# Override mode for this request
headroom_mode="audit",
# Reserve more tokens for output
headroom_output_buffer_tokens=8000,
# Keep last N turns (don't compress)
headroom_keep_turns=5,
# Skip compression for specific tools
headroom_tool_profiles={
"important_tool": {"skip_compression": True}
},
)SmartCrusher Configuration
Fine-tune JSON compression behavior:
from headroom.transforms import SmartCrusherConfig
config = SmartCrusherConfig(
# Maximum items to keep after compression
max_items_after_crush=15,
# Minimum tokens before applying compression
min_tokens_to_crush=200,
# Relevance scoring tier: "bm25" (fast) or "embedding" (accurate)
relevance_tier="bm25",
# Always keep items with these field values
preserve_fields=["error", "warning", "failure"],
)CacheAligner Configuration
Control prefix stabilization for provider cache hit rates:
from headroom.transforms import CacheAlignerConfig
config = CacheAlignerConfig(
# Enable/disable cache alignment
enabled=True,
# Patterns to extract from system prompt
dynamic_patterns=[
r"Today is \w+ \d+, \d{4}",
r"Current time: .*",
],
)RollingWindow Configuration
Control context window management when messages exceed model limits:
from headroom.transforms import RollingWindowConfig
config = RollingWindowConfig(
# Minimum turns to always keep
min_keep_turns=3,
# Reserve tokens for output
output_buffer_tokens=4000,
# Drop oldest tool outputs first
prefer_drop_tool_outputs=True,
)IntelligentContext Configuration
Semantic-aware context management with importance scoring:
from headroom.config import IntelligentContextConfig, ScoringWeights
# Customize scoring weights (must sum to 1.0, or will be normalized)
weights = ScoringWeights(
recency=0.20, # Newer messages score higher
semantic_similarity=0.20, # Similarity to recent context
toin_importance=0.25, # TOIN-learned retrieval patterns
error_indicator=0.15, # TOIN-learned error field types
forward_reference=0.15, # Messages referenced by later messages
token_density=0.05, # Information density
)
config = IntelligentContextConfig(
enabled=True,
keep_system=True, # Never drop system messages
keep_last_turns=2, # Protect last N user turns
output_buffer_tokens=4000, # Reserve for model output
use_importance_scoring=True,
scoring_weights=weights,
toin_integration=True, # Use TOIN patterns if available
recency_decay_rate=0.1, # Exponential decay lambda
compress_threshold=0.1, # Try compression first if <10% over budget
)Scoring Weights
Prop
Type
Weights are automatically normalized to sum to 1.0:
weights = ScoringWeights(recency=1.0, toin_importance=1.0)
normalized = weights.normalized()
# recency=0.5, toin_importance=0.5, others=0.0Proxy Configuration
Command Line Options
headroom proxy \
--port 8787 \ # Port to listen on
--host 0.0.0.0 \ # Host to bind to
--budget 10.00 \ # Daily budget limit in USD
--log-file headroom.jsonl # Log file pathFeature Flags
# Disable optimization (passthrough mode)
headroom proxy --no-optimize
# Disable semantic caching
headroom proxy --no-cache
# Enable LLMLingua ML compression
headroom proxy --llmlingua
headroom proxy --llmlingua --llmlingua-device cuda --llmlingua-rate 0.4Environment Variables
| Variable | Description | Default |
|---|---|---|
HEADROOM_LOG_LEVEL | Logging level | INFO |
HEADROOM_STORE_URL | Database URL | temp directory |
HEADROOM_DEFAULT_MODE | Default mode | optimize |
HEADROOM_MODEL_LIMITS | Custom model config (JSON string or file path) | -- |
HEADROOM_BASE_URL | Base URL of the Headroom proxy (TypeScript SDK) | http://localhost:8787 |
HEADROOM_API_KEY | API key for Headroom Cloud authentication | -- |
HEADROOM_SAVINGS_PATH | Override persistent savings file location | ~/.headroom/proxy_savings.json |
HEADROOM_TELEMETRY | Set to off to disable anonymous telemetry | on |
Custom Model Configuration
Configure context limits and pricing for new or custom models:
{
"anthropic": {
"context_limits": {
"claude-4-opus-20250301": 200000,
"claude-custom-finetune": 128000
},
"pricing": {
"claude-4-opus-20250301": {
"input": 15.00,
"output": 75.00,
"cached_input": 1.50
}
}
},
"openai": {
"context_limits": {
"gpt-5": 256000,
"ft:gpt-4o:my-org": 128000
}
}
}Save as ~/.headroom/models.json, or set HEADROOM_MODEL_LIMITS to a JSON string or file path.
Settings are resolved in this order (later overrides earlier):
- Built-in defaults
~/.headroom/models.jsonconfig fileHEADROOM_MODEL_LIMITSenvironment variable- SDK constructor arguments
Pattern-Based Inference
Unknown models are automatically inferred from naming patterns:
| Pattern | Inferred Settings |
|---|---|
*opus* | 200K context, Opus-tier pricing |
*sonnet* | 200K context, Sonnet-tier pricing |
*haiku* | 200K context, Haiku-tier pricing |
gpt-4o* | 128K context, GPT-4o pricing |
o1*, o3* | 200K context, reasoning model pricing |
Provider-Specific Settings
from headroom import OpenAIProvider
provider = OpenAIProvider(
enable_prefix_caching=True,
)from headroom import AnthropicProvider
provider = AnthropicProvider(
enable_cache_control=True,
)from headroom import GoogleProvider
provider = GoogleProvider(
enable_context_caching=True,
)Tool Profiles
Skip or customize compression for specific tools:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
headroom_tool_profiles={
"important_tool": {"skip_compression": True},
"search_tool": {"max_items_after_crush": 25},
},
)Configuration Precedence
Settings are applied in this order (later overrides earlier):
- Default values
- Environment variables
- SDK constructor arguments
- Per-request overrides
Validation
Validate your configuration at startup:
result = client.validate_setup()
if not result["valid"]:
print("Configuration issues:")
for issue in result["issues"]:
print(f" - {issue}")