Metrics & Monitoring
Monitor compression performance, cost savings, and system health with Headroom's built-in metrics, Prometheus endpoint, and SDK APIs.
Headroom provides comprehensive metrics for monitoring compression performance, cost savings, and system health through both the proxy server and the SDK.
Proxy Endpoints
Stats Endpoint
curl http://localhost:8787/stats{
"persistent_savings": {
"lifetime": {
"tokens_saved": 12500,
"compression_savings_usd": 0.04
}
},
"requests": {
"total": 42,
"cached": 5,
"rate_limited": 0,
"failed": 0
},
"tokens": {
"input": 50000,
"output": 8000,
"saved": 12500,
"savings_percent": 25.0
},
"cost": {
"total_cost_usd": 0.15,
"total_savings_usd": 0.04
},
"cache": {
"entries": 10,
"total_hits": 5
}
}Persistent savings are stored at ~/.headroom/proxy_savings.json and survive proxy restarts. Override the path with HEADROOM_SAVINGS_PATH.
Historical Savings
curl http://localhost:8787/stats-historyReturns durable compression history with hourly, daily, weekly, and monthly rollups. Supports CSV export:
curl "http://localhost:8787/stats-history?format=csv&series=daily"
curl "http://localhost:8787/stats-history?format=csv&series=monthly"Prometheus Metrics
curl http://localhost:8787/metrics# HELP headroom_requests_total Total requests processed
headroom_requests_total{mode="optimize"} 1234
# HELP headroom_tokens_saved_total Total tokens saved
headroom_tokens_saved_total 5678900
# HELP headroom_compression_ratio Compression ratio histogram
headroom_compression_ratio_bucket{le="0.5"} 890
headroom_compression_ratio_bucket{le="0.7"} 1100
headroom_compression_ratio_bucket{le="0.9"} 1200
# HELP headroom_latency_seconds Request latency histogram
headroom_latency_seconds_bucket{le="0.01"} 800
headroom_latency_seconds_bucket{le="0.1"} 1150
# HELP headroom_cache_hits_total Cache hit counter
headroom_cache_hits_total 456Health Check
curl http://localhost:8787/health{
"status": "healthy",
"version": "0.1.0",
"uptime_seconds": 3600,
"llmlingua_enabled": false
}SDK Metrics
Proxy Stats
The TypeScript SDK queries the proxy for stats:
import { } from 'headroom-ai';
const = new ();
// Get proxy stats
const = await .proxyStats();
.(`Tokens saved: ${.tokens.saved}`);
.(`Savings: ${.tokens.savings_percent}%`);Compression Result Metrics
Every compress() call returns metrics:
import { } from 'headroom-ai';
const = await (messages, { : 'gpt-4o' });
.(`Tokens: ${.tokensBefore} -> ${.tokensAfter}`);
.(`Saved: ${.tokensSaved} (${(.compressionRatio * 100).(1)}%)`);
.(`Transforms: ${.transformsApplied.join(', ')}`);Session Stats
Quick stats for the current session (no database query):
stats = client.get_stats()
print(f"Mode: {stats['config']['mode']}")
print(f"Tokens saved: {stats['session']['tokens_saved_total']}")
print(f"Avg compression: {stats['session']['compression_ratio_avg']:.1%}")Returns:
{
"session": {
"requests_total": 10,
"tokens_input_before": 50000,
"tokens_input_after": 35000,
"tokens_saved_total": 15000,
"tokens_output_total": 8000,
"cache_hits": 3,
"compression_ratio_avg": 0.70,
},
"config": {
"mode": "optimize",
"provider": "openai",
"cache_optimizer_enabled": True,
"semantic_cache_enabled": False,
},
"transforms": {
"smart_crusher_enabled": True,
"cache_aligner_enabled": True,
"rolling_window_enabled": True,
},
}Historical Metrics
Query stored metrics from the database:
from datetime import datetime, timedelta
metrics = client.get_metrics(
start_time=datetime.utcnow() - timedelta(hours=1),
limit=100,
)
for m in metrics:
print(f"{m.timestamp}: {m.tokens_input_before} -> {m.tokens_input_after}")Summary Statistics
Aggregate statistics across all stored metrics:
summary = client.get_summary()
print(f"Total requests: {summary['total_requests']}")
print(f"Total tokens saved: {summary['total_tokens_saved']}")
print(f"Average compression: {summary['avg_compression_ratio']:.1%}")
print(f"Total cost savings: ${summary['total_cost_saved_usd']:.2f}")Logging
import logging
# INFO level shows compression summaries
logging.basicConfig(level=logging.INFO)
# DEBUG level shows detailed transform decisions
logging.basicConfig(level=logging.DEBUG)Example output:
INFO:headroom.transforms.pipeline:Pipeline complete: 45000 -> 4500 tokens (saved 40500, 90.0% reduction)
INFO:headroom.transforms.smart_crusher:SmartCrusher applied top_n strategy: kept 15 of 1000 items
DEBUG:headroom.transforms.smart_crusher:Kept items: [0,1,2,42,77,97,98,99] (errors at 42, warnings at 77)# Log to file
headroom proxy --log-file headroom.jsonl
# Increase verbosity
headroom proxy --log-level debugCost Tracking
Budget Alerts
Set a budget limit in the proxy:
headroom proxy --budget 10.00When the budget is exceeded, requests return a budget exceeded error, the /stats endpoint shows budget status, and logs indicate the budget state.
Key Metrics to Monitor
| Metric | What It Tells You | Target |
|---|---|---|
tokens_saved_total | Total cost savings | Higher is better |
compression_ratio_avg | Efficiency | 0.7--0.9 typical |
cache_hit_rate | Cache effectiveness | >20% is good |
latency_p99 | Performance impact | <10ms |
failed_requests | Reliability | 0 |
Grafana Dashboard
Example Prometheus queries for a Grafana dashboard:
| Panel | PromQL |
|---|---|
| Tokens Saved | headroom_tokens_saved_total |
| Compression Ratio (median) | histogram_quantile(0.5, headroom_compression_ratio_bucket) |
| Request Latency (p99) | histogram_quantile(0.99, headroom_latency_seconds_bucket) |
| Cache Hit Rate | headroom_cache_hits_total / (headroom_cache_hits_total + headroom_cache_misses_total) |