Context Management
Intelligent importance-based context management that scores messages by learned patterns, with rolling window fallback and output buffer reservation.
When conversations grow beyond a model's context window, Headroom decides which messages to keep and which to drop. Instead of naively removing the oldest messages, IntelligentContext scores every message by learned importance and drops the least valuable ones first.
IntelligentContext
IntelligentContext is a message-level compressor. It analyzes your conversation, assigns an importance score to each message, and removes low-scoring messages until the conversation fits within the token budget.
Dropped messages are not lost -- they are stored in CCR for on-demand retrieval by the LLM.
100-message conversation (50K tokens) with a 32K budget
-> Score each message by importance
-> Drop 60 lowest-scoring messages
-> Cache dropped messages in CCR (hash=def456)
-> Insert marker: "60 messages dropped, retrieve: def456"
-> Final context: 40 messages within budgetScoring weights
Each message receives a weighted score from six factors:
| Weight | Default | Description |
|---|---|---|
recency | 0.20 | Exponential decay from the end of the conversation. Recent messages score higher. |
semantic_similarity | 0.20 | Embedding cosine similarity to recent context. Messages related to the current topic score higher. |
toin_importance | 0.25 | TOIN retrieval rate -- messages matching patterns that users frequently retrieve via CCR are scored higher. Learned across all users. |
error_indicator | 0.15 | TOIN field semantics error detection. Messages containing error patterns (learned, not hardcoded) are preserved. |
forward_reference | 0.15 | Count of later messages that reference this one. Messages that other messages depend on are kept. |
token_density | 0.05 | Unique tokens divided by total tokens. Dense, information-rich messages score higher than repetitive ones. |
No hardcoded patterns
Error detection does not rely on keyword matching like "error" or "fail". Instead, it uses TOIN's learned field_semantics.inferred_type to identify error-bearing messages -- this adapts to your specific data patterns across sessions and users.
Weights are automatically normalized to sum to 1.0, so you can set relative values without worrying about exact proportions.
Rolling window fallback
If IntelligentContext is disabled or scoring data is unavailable, Headroom falls back to a rolling window strategy:
- Drop the oldest messages first
- Always keep the system prompt
- Always keep the last N user/assistant turns
- Drop tool calls and their responses as atomic pairs (no orphaned tool data)
This provides a safe baseline that works without any learned data.
Protection rules
Headroom enforces several protections to ensure model output quality:
Output buffer reservation
A configurable number of tokens is reserved for the model's response. The context budget is calculated as:
context_budget = model_context_limit - output_buffer_tokensThis prevents the input from consuming the entire context window and leaving no room for the model to respond.
System message protection
System messages are never dropped. They contain critical instructions, persona definitions, and tool descriptions that the model needs throughout the conversation.
Turn protection
The last N user/assistant turns are always preserved, ensuring the model has immediate conversational context. By default, the last 2 turns are protected.
Configuration
import { } from "headroom-ai";
import type {
,
,
,
,
} from "headroom-ai";
// Scoring weights (normalized automatically)
const : = {
: 0.20,
: 0.20,
: 0.25,
: 0.15,
: 0.15,
: 0.05,
};
// IntelligentContext configuration
const : = {
: true,
: true,
: 2,
: 4000,
: true,
,
: true,
: 0.1,
: 0.1,
};
// Rolling window fallback
const : = {
: true,
: true,
: 3,
: 4000,
};
// Full configuration
const : = {
,
,
};
const = await (messages, {
: "gpt-4o",
,
});
.(`Compressed: ${.tokensBefore} -> ${.tokensAfter}`);from headroom import HeadroomClient, OpenAIProvider
from headroom.config import IntelligentContextConfig, ScoringWeights
from openai import OpenAI
# Customize scoring weights
weights = ScoringWeights(
recency=0.20,
semantic_similarity=0.20,
toin_importance=0.25,
error_indicator=0.15,
forward_reference=0.15,
token_density=0.05,
)
context_config = IntelligentContextConfig(
enabled=True,
keep_system=True, # Never drop system messages
keep_last_turns=2, # Protect last 2 user turns
output_buffer_tokens=4000, # Reserve for model output
use_importance_scoring=True,
scoring_weights=weights,
toin_integration=True, # Use TOIN patterns
recency_decay_rate=0.1, # Exponential decay lambda
compress_threshold=0.1, # Try compression first if <10% over budget
)
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
)
# Per-request overrides
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
headroom_output_buffer_tokens=8000, # More room for long responses
headroom_keep_turns=5, # Protect last 5 turns
)How scoring improves over time
IntelligentContext integrates with TOIN (Tool-Output Intelligence Network) to learn from real usage:
- Messages are dropped based on current scores
- Dropped messages are stored in CCR
- If the LLM retrieves a dropped message, TOIN records that pattern
- Future conversations score similar message patterns higher
- Drop accuracy improves across all users, not just within one session
This feedback loop means the system gets smarter the more it is used. Error messages that users frequently need are automatically preserved, while verbose success messages that nobody retrieves are dropped more aggressively.