Persistent Memory
Hierarchical, temporal memory for LLM applications. Enable your AI to remember across conversations with intelligent scoping and versioning.
LLMs have two fundamental limitations: context windows overflow with too much history, and every conversation starts from zero. Persistent Memory solves both by extracting key facts, persisting them, and injecting them when relevant.
This is temporal compression -- instead of carrying 10,000 tokens of conversation history, carry 100 tokens of extracted memories.
Quick Start
from openai import OpenAI
from headroom import with_memory
# One line -- that's it
client = with_memory(OpenAI(), user_id="alice")
# Use exactly like normal
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "I prefer Python for backend work"}]
)
# Memory extracted INLINE -- zero extra latency
# Later, in a new conversation...
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What language should I use?"}]
)
# Response uses the Python preference from memoryHow It Works
The with_memory() wrapper intercepts every chat completion call:
- Inject -- Semantic search finds relevant memories and prepends them to the user message
- Instruct -- Adds a memory extraction instruction to the system prompt
- Call -- Forwards the request to the LLM
- Parse -- Extracts the
<memory>block from the response - Store -- Saves with embeddings, vector index, and full-text search index
- Return -- Cleans the response (strips the memory block before returning)
Memory extraction happens inline as part of the LLM response. No extra API calls, no extra latency.
Hierarchical Scoping
Memories exist at four scope levels, from broadest to narrowest:
| Scope | Persists Across | Use Case |
|---|---|---|
| User | All sessions, all time | Long-term preferences, identity |
| Session | Current session only | Current task context |
| Agent | Current agent in session | Agent-specific context |
| Turn | Single turn only | Ephemeral working memory |
from openai import OpenAI
from headroom import with_memory
# Session 1: Morning
client1 = with_memory(
OpenAI(),
user_id="bob",
session_id="morning-session",
)
response = client1.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "I prefer Go for performance-critical code"}]
)
# Memory stored at USER level (persists across sessions)
# Session 2: Afternoon (different session, same user)
client2 = with_memory(
OpenAI(),
user_id="bob",
session_id="afternoon-session",
)
response = client2.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What language for my new microservice?"}]
)
# Recalls Go preference from morning sessionMemory Categories
Memories are categorized for better organization and retrieval:
| Category | Description | Examples |
|---|---|---|
PREFERENCE | Likes, dislikes, preferred approaches | "Prefers Python", "Likes dark mode" |
FACT | Identity, role, constraints | "Works at fintech startup", "Senior engineer" |
CONTEXT | Current goals, ongoing tasks | "Migrating to microservices", "Working on auth" |
ENTITY | Information about entities | "Project Apollo uses React", "Team lead is Sarah" |
DECISION | Decisions made | "Chose PostgreSQL over MySQL" |
INSIGHT | Derived insights | "User tends to prefer typed languages" |
Memory API
The with_memory() wrapper exposes a .memory attribute for direct access:
client = with_memory(OpenAI(), user_id="alice")
# Search memories (semantic)
results = client.memory.search("python preferences", top_k=5)
for memory in results:
print(f"{memory.content}")
# Add a memory manually
client.memory.add(
"User is a senior engineer",
category="fact",
importance=0.9,
)
# Get all memories for this user
all_memories = client.memory.get_all()
# Clear all memories
client.memory.clear()
# Get stats
stats = client.memory.stats()
print(f"Total memories: {stats['total']}")
print(f"By category: {stats['categories']}")Temporal Versioning
When facts change, Headroom creates a supersession chain that preserves history:
from headroom.memory import HierarchicalMemory, MemoryCategory
memory = await HierarchicalMemory.create()
# Original fact
orig = await memory.add(
content="User works at Google",
user_id="alice",
category=MemoryCategory.FACT,
)
# User changes jobs -- supersede the old memory
new = await memory.supersede(
old_memory_id=orig.id,
new_content="User now works at Anthropic",
)
# Query current state (excludes superseded by default)
current = await memory.query(MemoryFilter(
user_id="alice",
include_superseded=False,
))
# Returns only "User now works at Anthropic"
# Get the full chain
chain = await memory.get_history(new.id)
# [
# Memory(content="User works at Google", is_current=False),
# Memory(content="User now works at Anthropic", is_current=True),
# ]This gives you an audit trail, the ability to debug why the LLM made certain decisions, and rollback if needed.
Backends
Embedder Backends
from headroom.memory import MemoryConfig, EmbedderBackend
# Local embeddings (recommended -- fast, free, private)
config = MemoryConfig(
embedder_backend=EmbedderBackend.LOCAL,
embedder_model="all-MiniLM-L6-v2",
)
# OpenAI embeddings (higher quality, costs money)
config = MemoryConfig(
embedder_backend=EmbedderBackend.OPENAI,
openai_api_key="sk-...",
embedder_model="text-embedding-3-small",
)
# Ollama embeddings (local server, many models)
config = MemoryConfig(
embedder_backend=EmbedderBackend.OLLAMA,
ollama_base_url="http://localhost:11434",
embedder_model="nomic-embed-text",
)Storage
Storage uses SQLite for CRUD and filtering, HNSW for vector similarity search, and FTS5 for full-text keyword search. All embedded -- no external services required.
config = MemoryConfig(
db_path="memory.db",
vector_dimension=384,
hnsw_ef_construction=200,
hnsw_m=16,
hnsw_ef_search=50,
cache_enabled=True,
cache_max_size=1000,
)Provider Compatibility
Memory works with any OpenAI-compatible client:
from openai import OpenAI
from headroom import with_memory
# OpenAI
client = with_memory(OpenAI(), user_id="alice")
# Azure OpenAI
client = with_memory(
OpenAI(base_url="https://your-resource.openai.azure.com/..."),
user_id="alice",
)
# Groq
from groq import Groq
client = with_memory(Groq(), user_id="alice")Performance
| Operation | Latency | Notes |
|---|---|---|
| Memory injection | <50ms | Local embeddings + HNSW search |
| Memory extraction | +50-100 tokens | Part of LLM response (inline) |
| Memory storage | <10ms | SQLite + HNSW + FTS5 indexing |
| Cache hit | <1ms | LRU cache lookup |