Context Management

Intelligent importance-based context management that scores messages by learned patterns, with rolling window fallback and output buffer reservation.

When conversations grow beyond a model's context window, Headroom decides which messages to keep and which to drop. Instead of naively removing the oldest messages, IntelligentContext scores every message by learned importance and drops the least valuable ones first.

IntelligentContext

IntelligentContext is a message-level compressor. It analyzes your conversation, assigns an importance score to each message, and removes low-scoring messages until the conversation fits within the token budget.

Dropped messages are not lost -- they are stored in CCR for on-demand retrieval by the LLM.

100-message conversation (50K tokens) with a 32K budget
  -> Score each message by importance
  -> Drop 60 lowest-scoring messages
  -> Cache dropped messages in CCR (hash=def456)
  -> Insert marker: "60 messages dropped, retrieve: def456"
  -> Final context: 40 messages within budget

Scoring weights

Each message receives a weighted score from six factors:

Weight	Default	Description
`recency`	0.20	Exponential decay from the end of the conversation. Recent messages score higher.
`semantic_similarity`	0.20	Embedding cosine similarity to recent context. Messages related to the current topic score higher.
`toin_importance`	0.25	TOIN retrieval rate -- messages matching patterns that users frequently retrieve via CCR are scored higher. Learned across all users.
`error_indicator`	0.15	TOIN field semantics error detection. Messages containing error patterns (learned, not hardcoded) are preserved.
`forward_reference`	0.15	Count of later messages that reference this one. Messages that other messages depend on are kept.
`token_density`	0.05	Unique tokens divided by total tokens. Dense, information-rich messages score higher than repetitive ones.

No hardcoded patterns

Error detection does not rely on keyword matching like "error" or "fail". Instead, it uses TOIN's learned field_semantics.inferred_type to identify error-bearing messages -- this adapts to your specific data patterns across sessions and users.

Weights are automatically normalized to sum to 1.0, so you can set relative values without worrying about exact proportions.

Rolling window fallback

If IntelligentContext is disabled or scoring data is unavailable, Headroom falls back to a rolling window strategy:

Drop the oldest messages first
Always keep the system prompt
Always keep the last N user/assistant turns
Drop tool calls and their responses as atomic pairs (no orphaned tool data)

This provides a safe baseline that works without any learned data.

Protection rules

Headroom enforces several protections to ensure model output quality:

Output buffer reservation

A configurable number of tokens is reserved for the model's response. The context budget is calculated as:

context_budget = model_context_limit - output_buffer_tokens

This prevents the input from consuming the entire context window and leaving no room for the model to respond.

System message protection

System messages are never dropped. They contain critical instructions, persona definitions, and tool descriptions that the model needs throughout the conversation.

Turn protection

The last N user/assistant turns are always preserved, ensuring the model has immediate conversational context. By default, the last 2 turns are protected.

Configuration

import {  } from "headroom-ai";
import type {
  ,
  ,
  ,
  ,
} from "headroom-ai";

// Scoring weights (normalized automatically)
const :  = {
  : 0.20,
  : 0.20,
  : 0.25,
  : 0.15,
  : 0.15,
  : 0.05,
};

// IntelligentContext configuration
const :  = {
  : true,
  : true,
  : 2,
  : 4000,
  : true,
  ,
  : true,
  : 0.1,
  : 0.1,
};

// Rolling window fallback
const :  = {
  : true,
  : true,
  : 3,
  : 4000,
};

// Full configuration
const :  = {
  ,
  ,
};

const  = await (messages, {
  : "gpt-4o",
  ,
});

.(`Compressed: ${.tokensBefore} -> ${.tokensAfter}`);

from headroom import HeadroomClient, OpenAIProvider
from headroom.config import IntelligentContextConfig, ScoringWeights
from openai import OpenAI

# Customize scoring weights
weights = ScoringWeights(
    recency=0.20,
    semantic_similarity=0.20,
    toin_importance=0.25,
    error_indicator=0.15,
    forward_reference=0.15,
    token_density=0.05,
)

context_config = IntelligentContextConfig(
    enabled=True,
    keep_system=True,            # Never drop system messages
    keep_last_turns=2,           # Protect last 2 user turns
    output_buffer_tokens=4000,   # Reserve for model output
    use_importance_scoring=True,
    scoring_weights=weights,
    toin_integration=True,       # Use TOIN patterns
    recency_decay_rate=0.1,      # Exponential decay lambda
    compress_threshold=0.1,      # Try compression first if <10% over budget
)

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",
)

# Per-request overrides
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    headroom_output_buffer_tokens=8000,  # More room for long responses
    headroom_keep_turns=5,               # Protect last 5 turns
)

How scoring improves over time

IntelligentContext integrates with TOIN (Tool-Output Intelligence Network) to learn from real usage:

Messages are dropped based on current scores
Dropped messages are stored in CCR
If the LLM retrieves a dropped message, TOIN records that pattern
Future conversations score similar message patterns higher
Drop accuracy improves across all users, not just within one session

This feedback loop means the system gets smarter the more it is used. Error messages that users frequently need are automatically preserved, while verbose success messages that nobody retrieves are dropped more aggressively.

Context Management

On this page