42.3. Memory Systems (Vector DBs for Long-term Recall)
Status: Draft Version: 1.0.0 Tags: #Memory, #VectorDB, #Qdrant, #Rust, #MemGPT Author: MLOps Team
Table of Contents
- The Goldfish Problem
- The Memory Hierarchy: Sensory, Working, Long-Term
- Vector Databases as the Hippocampus
- Rust Implementation: Semantic Memory Module
- Context Paging: The MemGPT Pattern
- Memory Consolidation: Sleep Jobs
- Infrastructure: Scaling Qdrant / Weaviate
- Troubleshooting: Why Does My Agent Forget?
- Future Trends: Neural Turing Machines
- MLOps Interview Questions
- Glossary
- Summary Checklist
The Goldfish Problem
Standard LLMs have Amnesia. Every time you send a request, it’s a blank slate. Methods to fix this:
- Context Stuffing: Paste previous chat in prompt. (Limited by 8k/32k tokens).
- Summary: Summarize old chat. (Lossy).
- Vector Retrieval: Retrieve only relevant past chats. (The Solution).
The Memory Hierarchy: Sensory, Working, Long-Term
Cognitive Science gives us a blueprint.
| Type | Human | Agent | Capacity |
|---|---|---|---|
| Sensory | 0.5s (Iconic) | Raw Input Buffer | Infinite (Log Stream) |
| Working (STM) | 7 $\pm$ 2 items | Context Window | 128k Tokens |
| Long-Term (LTM) | Lifetime | Vector Database | Petabytes |
The Goal: Move items from STM to LTM before they slide out of the Context Window.
Vector Databases as the Hippocampus
The Hippocampus indexes memories by content, not just time. “Where did I leave my keys?” -> Activates neurons for “Keys”.
Vector Search:
- Query: “Keys”. Embedding:
[0.1, 0.9, -0.2]. - Search DB: Find vectors closest (Cosine Similarity) to query.
- Result: “I put them on the table” (
[0.12, 0.88, -0.1]).
Deep Dive: HNSW (Hierarchical Navigable Small World)
How do we find the closest vector among 1 Billion vectors in 5ms? We can’t scan them all ($O(N)$). HNSW is a graph algorithm ($O(\log N)$).
- Layer 0: A dense graph of all points.
- Layer 1: A sparse graph (skip list).
- Layer 2: Even sparser. Search starts at top layer, zooms in to the neighborhood, then drops down a layer. Like finding a house using “Continent -> Country -> City -> Street”.
Rust Implementation: Semantic Memory Module
We build a persistent memory module using Qdrant (Rust-based Vector DB).
Project Structure
agent-memory/
├── Cargo.toml
└── src/
└── lib.rs
Cargo.toml:
[package]
name = "agent-memory"
version = "0.1.0"
edition = "2021"
[dependencies]
qdrant-client = "1.5"
tokio = { version = "1", features = ["full"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
async-openai = "0.14" // For embedding generation
anyhow = "1.0"
uuid = { version = "1.0", features = ["v4"] }
src/lib.rs:
#![allow(unused)]
fn main() {
use qdrant_client::prelude::*;
use qdrant_client::qdrant::{PointStruct, Vector, VectorsConfig, VectorParams, Distance};
use async_openai::{Client, types::CreateEmbeddingRequestArgs};
use uuid::Uuid;
pub struct MemoryManager {
qdrant: QdrantClient,
openai: Client<async_openai::config::OpenAIConfig>,
collection: String,
}
impl MemoryManager {
/// Initialize the Memory Manager.
/// Connects to Qdrant and creates the collection if missing.
pub async fn new(url: &str, collection: &str) -> Result<Self, anyhow::Error> {
let qdrant = QdrantClient::from_url(url).build()?;
let openai = Client::new();
// Critical: Check if collection exists before writing.
if !qdrant.has_collection(collection.to_string()).await? {
println!("Creating collection: {}", collection);
qdrant.create_collection(&CreateCollection {
collection_name: collection.to_string(),
// Config must match the embedding model dimensionality
vectors_config: Some(VectorsConfig {
config: Some(vectors_config::Config::Params(VectorParams {
size: 1536, // OpenAI Ada-002 dimension
distance: Distance::Cosine.into(),
..Default::default()
})),
}),
..Default::default()
}).await?;
}
Ok(Self {
qdrant,
openai,
collection: collection.to_string()
})
}
/// Add a thought/observation to Long Term Memory
pub async fn remember(&self, text: &str) -> Result<(), anyhow::Error> {
// 1. Generate Embedding
// Cost Alert: This costs money. Batch this in production.
let request = CreateEmbeddingRequestArgs::default()
.model("text-embedding-ada-002")
.input(text)
.build()?;
let response = self.openai.embeddings().create(request).await?;
let vector = response.data[0].embedding.clone();
// 2. Wrap in Qdrant Point
let point = PointStruct::new(
Uuid::new_v4().to_string(), // Random ID
vector,
// Store the original text as Payload so we can read it back
Payload::from_json(serde_json::json!({
"text": text,
"timestamp": chrono::Utc::now().to_rfc3339()
})),
);
// 3. Upsert
self.qdrant.upsert_points(
self.collection.clone(),
None,
vec![point],
None,
).await?;
Ok(())
}
/// Retrieve relevant memories
pub async fn recall(&self, query: &str, limit: u64) -> Result<Vec<String>, anyhow::Error> {
// 1. Embed Query
let request = CreateEmbeddingRequestArgs::default()
.model("text-embedding-ada-002")
.input(query)
.build()?;
let response = self.openai.embeddings().create(request).await?;
let vector = response.data[0].embedding.clone();
// 2. Search
let search_result = self.qdrant.search_points(&SearchPoints {
collection_name: self.collection.clone(),
vector: vector,
limit: limit,
with_payload: Some(true.into()),
// Add filtering here if you have Multi-Tenancy!
// filter: Some(Filter::new_must(Condition::matches("user_id", "123"))),
..Default::default()
}).await?;
// 3. Extract Text from Payload
let memories: Vec<String> = search_result.result.into_iter().filter_map(|p| {
// "text" field in payload
p.payload.get("text")?.as_str().map(|s| s.to_string())
}).collect();
Ok(memories)
}
}
}
Context Paging: The MemGPT Pattern
How do large OSs handle limited RAM? Paging. They swap memory to Disk. MemGPT does the same for Agents.
The Context Window is RAM. The Vector DB is Disk. The Agent has special tools:
CoreMemory.append(text): Writes to System Prompt (Pinned RAM).ArchivalMemory.search(query): Reads from Vector DB (Disk).ArchivalMemory.insert(text): Writes to Vector DB (Disk).
The LLM decides what to keep in RAM and what to swap to Disk.
Memory Consolidation: Sleep Jobs
Humans consolidate memories during sleep. Agents need Offline Consolidation Jobs.
The “Dreaming” Pipeline (Cron Job):
- Fetch all memories from the last 24h.
- Clustering: Group related memories (“User asked about Python”, “User asked about Rust”).
- Summarization: Replace 50 raw logs with 1 summary (“User is a polyglot programmer”).
- Garbage Collection: Delete duplicate or trivial logs (“Hello”, “Ok”).
Infrastructure: Scaling Qdrant / Weaviate
Index building is CPU intensive. Search is Latency sensitive.
Reference Architecture:
- Write Node (Indexer): High CPU. Batches updates. Rebuilds HNSW graphs.
- Read Replicas: High RAM (cache vectors). Serve queries.
- Sharding: Shard by
User_ID. User A’s memories never mix with User B’s.
# Docker Compose for Qdrant Cluster
version: '3.8'
services:
qdrant-primary:
image: qdrant/qdrant:latest
environment:
- QDRANT__CLUSTER__ENABLED=true
qdrant-node-1:
image: qdrant/qdrant:latest
environment:
- QDRANT__BOOTSTRAP=qdrant-primary:6335
Troubleshooting: Why Does My Agent Forget?
Scenario 1: The Recency Bias
- Symptom: Agent remembers what you said 2 minutes ago, but not 2 days ago.
- Cause: Standard cosine search returns most relevant, not most recent. If “Hello” (today) has low similarity to “Project Specs” (yesterday), it won’t appear.
- Fix: Recency-Weighted Scoring. $Score = CosineSim(q, d) \times Decay(time)$.
Scenario 2: Index Fragmentation
- Symptom: Recall speed drops to 500ms.
- Cause: Frequent updates (Insert/Delete) fragment the HNSW graph.
- Fix: Optimize/Vacuum the index nightly.
Scenario 3: The Duplicate Memory
- Symptom: Agent retrieves “My name is Alex” 5 times.
- Cause: You inserted the same memory every time the user mentioned their name.
- Fix: Deduplication. Before insert, query for semantic duplicates (Distance < 0.01). If found, update timestamp instead of inserting new.
Scenario 4: Cosine Similarity > 1.0?
- Symptom: Metric returns 1.00001.
- Cause: Floating point error or vectors not normalized.
- Fix: Always normalize vectors ($v / ||v||$) before insertion.
Future Trends: Neural Turing Machines
Vector DBs are external. NTM / MANN (Memory Augmented Neural Networks): The memory is differentiable. The Network learns how to read/write memory during backprop. Currently research (DeepMind), but will replace manual Vector DB lookup eventually.
MLOps Interview Questions
-
Q: What is “HNSW”? A: Hierarchical Navigable Small World. The standard algorithm for Approximate Nearest Neighbor (ANN) search. It’s like a Skip List for high-dimensional vectors.
-
Q: Why not just fine-tune the LLM on the user’s data? A: Fine-tuning is slow and expensive. You can’t fine-tune after every chat message. Vector DB provides Instant Knowledge Update. (RAG > Fine-Tuning for facts).
-
Q: How do you handle “referential ambiguity”? A: User says “Delete it.” What is “it”? The Agent needs to query STM (History) to resolve “it” = “file.txt” before retrieving from LTM.
-
Q: What is the dimensionality of Ada-002? A: 1536 dimensions.
-
Q: How do you secure the Vector DB? A: RLS (Row Level Security) aka Filtering. Every query MUST have
filter: { user_id: "alex" }. Failing to filter is a massive privacy breach (Data Leakage between users).
Glossary
- HNSW: Graph-based algorithm for vector search.
- Embeddings: converting text to numbers.
- RAG: Retrieval Augmented Generation.
- Semantic Search: Searching by meaning, not kw.
Summary Checklist
- Filtering: Always filter by
session_idoruser_id. One user must never see another’s vectors. - Dimension Check: Ensure Embedding Model output (1536) matches DB Config. Mismatch = Crash.
- Dedup: Hash content before inserting. Don’t store “Hi” 1000 times.
- Backup: Vector DBs are stateful. Snapshot them to S3 daily.
- Latency: Retrieval should be < 50ms. If > 100ms, check HNSW build parameters (
m,ef_construct).