Headroom

Persistent Memory

Hierarchical, temporal memory for LLM applications. Enable your AI to remember across conversations with intelligent scoping and versioning.

LLMs have two fundamental limitations: context windows overflow with too much history, and every conversation starts from zero. Persistent Memory solves both by extracting key facts, persisting them, and injecting them when relevant.

This is temporal compression -- instead of carrying 10,000 tokens of conversation history, carry 100 tokens of extracted memories.

Quick Start

from openai import OpenAI
from headroom import with_memory

# One line -- that's it
client = with_memory(OpenAI(), user_id="alice")

# Use exactly like normal
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "I prefer Python for backend work"}]
)
# Memory extracted INLINE -- zero extra latency

# Later, in a new conversation...
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What language should I use?"}]
)
# Response uses the Python preference from memory

How It Works

The with_memory() wrapper intercepts every chat completion call:

  1. Inject -- Semantic search finds relevant memories and prepends them to the user message
  2. Instruct -- Adds a memory extraction instruction to the system prompt
  3. Call -- Forwards the request to the LLM
  4. Parse -- Extracts the <memory> block from the response
  5. Store -- Saves with embeddings, vector index, and full-text search index
  6. Return -- Cleans the response (strips the memory block before returning)

Memory extraction happens inline as part of the LLM response. No extra API calls, no extra latency.

Hierarchical Scoping

Memories exist at four scope levels, from broadest to narrowest:

ScopePersists AcrossUse Case
UserAll sessions, all timeLong-term preferences, identity
SessionCurrent session onlyCurrent task context
AgentCurrent agent in sessionAgent-specific context
TurnSingle turn onlyEphemeral working memory
from openai import OpenAI
from headroom import with_memory

# Session 1: Morning
client1 = with_memory(
    OpenAI(),
    user_id="bob",
    session_id="morning-session",
)
response = client1.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "I prefer Go for performance-critical code"}]
)
# Memory stored at USER level (persists across sessions)

# Session 2: Afternoon (different session, same user)
client2 = with_memory(
    OpenAI(),
    user_id="bob",
    session_id="afternoon-session",
)
response = client2.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What language for my new microservice?"}]
)
# Recalls Go preference from morning session

Memory Categories

Memories are categorized for better organization and retrieval:

CategoryDescriptionExamples
PREFERENCELikes, dislikes, preferred approaches"Prefers Python", "Likes dark mode"
FACTIdentity, role, constraints"Works at fintech startup", "Senior engineer"
CONTEXTCurrent goals, ongoing tasks"Migrating to microservices", "Working on auth"
ENTITYInformation about entities"Project Apollo uses React", "Team lead is Sarah"
DECISIONDecisions made"Chose PostgreSQL over MySQL"
INSIGHTDerived insights"User tends to prefer typed languages"

Memory API

The with_memory() wrapper exposes a .memory attribute for direct access:

client = with_memory(OpenAI(), user_id="alice")

# Search memories (semantic)
results = client.memory.search("python preferences", top_k=5)
for memory in results:
    print(f"{memory.content}")

# Add a memory manually
client.memory.add(
    "User is a senior engineer",
    category="fact",
    importance=0.9,
)

# Get all memories for this user
all_memories = client.memory.get_all()

# Clear all memories
client.memory.clear()

# Get stats
stats = client.memory.stats()
print(f"Total memories: {stats['total']}")
print(f"By category: {stats['categories']}")

Temporal Versioning

When facts change, Headroom creates a supersession chain that preserves history:

from headroom.memory import HierarchicalMemory, MemoryCategory

memory = await HierarchicalMemory.create()

# Original fact
orig = await memory.add(
    content="User works at Google",
    user_id="alice",
    category=MemoryCategory.FACT,
)

# User changes jobs -- supersede the old memory
new = await memory.supersede(
    old_memory_id=orig.id,
    new_content="User now works at Anthropic",
)

# Query current state (excludes superseded by default)
current = await memory.query(MemoryFilter(
    user_id="alice",
    include_superseded=False,
))
# Returns only "User now works at Anthropic"

# Get the full chain
chain = await memory.get_history(new.id)
# [
#   Memory(content="User works at Google", is_current=False),
#   Memory(content="User now works at Anthropic", is_current=True),
# ]

This gives you an audit trail, the ability to debug why the LLM made certain decisions, and rollback if needed.

Backends

Embedder Backends

from headroom.memory import MemoryConfig, EmbedderBackend

# Local embeddings (recommended -- fast, free, private)
config = MemoryConfig(
    embedder_backend=EmbedderBackend.LOCAL,
    embedder_model="all-MiniLM-L6-v2",
)

# OpenAI embeddings (higher quality, costs money)
config = MemoryConfig(
    embedder_backend=EmbedderBackend.OPENAI,
    openai_api_key="sk-...",
    embedder_model="text-embedding-3-small",
)

# Ollama embeddings (local server, many models)
config = MemoryConfig(
    embedder_backend=EmbedderBackend.OLLAMA,
    ollama_base_url="http://localhost:11434",
    embedder_model="nomic-embed-text",
)

Storage

Storage uses SQLite for CRUD and filtering, HNSW for vector similarity search, and FTS5 for full-text keyword search. All embedded -- no external services required.

config = MemoryConfig(
    db_path="memory.db",
    vector_dimension=384,
    hnsw_ef_construction=200,
    hnsw_m=16,
    hnsw_ef_search=50,
    cache_enabled=True,
    cache_max_size=1000,
)

Provider Compatibility

Memory works with any OpenAI-compatible client:

from openai import OpenAI
from headroom import with_memory

# OpenAI
client = with_memory(OpenAI(), user_id="alice")

# Azure OpenAI
client = with_memory(
    OpenAI(base_url="https://your-resource.openai.azure.com/..."),
    user_id="alice",
)

# Groq
from groq import Groq
client = with_memory(Groq(), user_id="alice")

Performance

OperationLatencyNotes
Memory injection<50msLocal embeddings + HNSW search
Memory extraction+50-100 tokensPart of LLM response (inline)
Memory storage<10msSQLite + HNSW + FTS5 indexing
Cache hit<1msLRU cache lookup

On this page