Headroom

LangChain

Automatic context compression for LangChain chat models, memory, retrievers, and agents.

Headroom integrates with LangChain to compress context across all LangChain patterns: chat models, memory, retrievers, agents, and streaming.

Installation

pip install "headroom-ai[langchain]"

Quick start

Wrap any chat model in one line:

from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

# Use exactly like before
response = llm.invoke("Hello!")

# Check savings
print(llm.get_metrics())
# {'tokens_saved': 12500, 'savings_percent': 45.2, 'requests': 50}

Works with any provider:

from langchain_anthropic import ChatAnthropic

llm = HeadroomChatModel(ChatAnthropic(model="claude-sonnet-4-20250514"))

Memory integration

HeadroomChatMessageHistory wraps any chat history with automatic compression. Long conversations stay under your token budget:

from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory

base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
    base_history,
    compress_threshold_tokens=4000,  # Compress when over 4K tokens
    keep_recent_turns=5,             # Always keep last 5 turns
)

memory = ConversationBufferMemory(chat_memory=compressed_history)

After usage:

print(compressed_history.get_compression_stats())
# {'compression_count': 12, 'total_tokens_saved': 28000}

Retriever integration

HeadroomDocumentCompressor filters retrieved documents by relevance. Retrieve many for recall, keep the best for precision:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import FAISS
from headroom.integrations import HeadroomDocumentCompressor

base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

compressor = HeadroomDocumentCompressor(
    max_documents=10,
    min_relevance=0.3,
    prefer_diverse=True,  # MMR-style diversity
)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

# Retrieves 50 docs, returns best 10
docs = retriever.invoke("What is Python?")

Agent tool wrapping

wrap_tools_with_headroom compresses tool outputs before they re-enter the agent's context:

from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom

@tool
def search_database(query: str) -> str:
    """Search the database."""
    return json.dumps({"results": [...], "total": 1000})

wrapped_tools = wrap_tools_with_headroom(
    [search_database],
    min_chars_to_compress=1000,
)

agent = create_openai_tools_agent(llm, wrapped_tools, prompt)
executor = AgentExecutor(agent=agent, tools=wrapped_tools)

Per-tool metrics:

from headroom.integrations import get_tool_metrics

metrics = get_tool_metrics()
print(metrics.get_summary())
# {'total_invocations': 25, 'total_compressions': 18, 'total_chars_saved': 450000}

LangGraph ReAct agent

from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from headroom.integrations import HeadroomChatModel, wrap_tools_with_headroom

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
tools = wrap_tools_with_headroom([search_web, query_database])

agent = create_react_agent(llm, tools)
result = agent.invoke({
    "messages": [("user", "Find users who signed up last week")]
})

LangGraph custom graph

Insert a compression node between tools and the agent in a custom StateGraph:

from langgraph.graph import StateGraph, MessagesState, START, END
from headroom.integrations.langchain import create_compress_tool_messages_node

graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tools_node)
graph.add_node("compress", create_compress_tool_messages_node(
    min_tokens_to_compress=100,
))

# Wire: tools -> compress -> agent
graph.add_edge(START, "agent")
graph.add_edge("tools", "compress")
graph.add_edge("compress", "agent")

Streaming

Full async support:

# Async invoke
response = await llm.ainvoke("Hello!")

# Async streaming
async for chunk in llm.astream("Tell me a story"):
    print(chunk.content, end="", flush=True)

Custom configuration

from headroom import HeadroomConfig, HeadroomMode

config = HeadroomConfig(
    default_mode=HeadroomMode.OPTIMIZE,
    smart_crusher_target_ratio=0.3,
)

llm = HeadroomChatModel(
    ChatOpenAI(model="gpt-4o"),
    headroom_config=config,
)

On this page