42.3. Memory Systems (Vector DBs for Long-term Recall)

Status: Draft Version: 1.0.0 Tags: #Memory, #VectorDB, #Qdrant, #Rust, #MemGPT Author: MLOps Team

The Goldfish Problem
The Memory Hierarchy: Sensory, Working, Long-Term
Vector Databases as the Hippocampus
Rust Implementation: Semantic Memory Module
Context Paging: The MemGPT Pattern
Memory Consolidation: Sleep Jobs
Infrastructure: Scaling Qdrant / Weaviate
Troubleshooting: Why Does My Agent Forget?
Future Trends: Neural Turing Machines
MLOps Interview Questions
Glossary
Summary Checklist

The Goldfish Problem

Standard LLMs have Amnesia. Every time you send a request, it’s a blank slate. Methods to fix this:

Context Stuffing: Paste previous chat in prompt. (Limited by 8k/32k tokens).
Summary: Summarize old chat. (Lossy).
Vector Retrieval: Retrieve only relevant past chats. (The Solution).

The Memory Hierarchy: Sensory, Working, Long-Term

Cognitive Science gives us a blueprint.

Type	Human	Agent	Capacity
Sensory	0.5s (Iconic)	Raw Input Buffer	Infinite (Log Stream)
Working (STM)	7 $\pm$ 2 items	Context Window	128k Tokens
Long-Term (LTM)	Lifetime	Vector Database	Petabytes

The Goal: Move items from STM to LTM before they slide out of the Context Window.

Vector Databases as the Hippocampus

The Hippocampus indexes memories by content, not just time. “Where did I leave my keys?” -> Activates neurons for “Keys”.

Vector Search:

Query: “Keys”. Embedding: [0.1, 0.9, -0.2].
Search DB: Find vectors closest (Cosine Similarity) to query.
Result: “I put them on the table” ([0.12, 0.88, -0.1]).

Deep Dive: HNSW (Hierarchical Navigable Small World)

How do we find the closest vector among 1 Billion vectors in 5ms? We can’t scan them all ($O(N)$). HNSW is a graph algorithm ($O(\log N)$).

Layer 0: A dense graph of all points.
Layer 1: A sparse graph (skip list).
Layer 2: Even sparser. Search starts at top layer, zooms in to the neighborhood, then drops down a layer. Like finding a house using “Continent -> Country -> City -> Street”.

Rust Implementation: Semantic Memory Module

We build a persistent memory module using Qdrant (Rust-based Vector DB).

Project Structure

agent-memory/
├── Cargo.toml
└── src/
    └── lib.rs

Cargo.toml:

[package]
name = "agent-memory"
version = "0.1.0"
edition = "2021"

[dependencies]
qdrant-client = "1.5"
tokio = { version = "1", features = ["full"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
async-openai = "0.14" // For embedding generation
anyhow = "1.0"
uuid = { version = "1.0", features = ["v4"] }

src/lib.rs:

#![allow(unused)]
fn main() {
use qdrant_client::prelude::*;
use qdrant_client::qdrant::{PointStruct, Vector, VectorsConfig, VectorParams, Distance};
use async_openai::{Client, types::CreateEmbeddingRequestArgs};
use uuid::Uuid;

pub struct MemoryManager {
    qdrant: QdrantClient,
    openai: Client<async_openai::config::OpenAIConfig>,
    collection: String,
}

impl MemoryManager {
    /// Initialize the Memory Manager.
    /// Connects to Qdrant and creates the collection if missing.
    pub async fn new(url: &str, collection: &str) -> Result<Self, anyhow::Error> {
        let qdrant = QdrantClient::from_url(url).build()?;
        let openai = Client::new();
        
        // Critical: Check if collection exists before writing.
        if !qdrant.has_collection(collection.to_string()).await? {
            println!("Creating collection: {}", collection);
            qdrant.create_collection(&CreateCollection {
                collection_name: collection.to_string(),
                // Config must match the embedding model dimensionality
                vectors_config: Some(VectorsConfig {
                    config: Some(vectors_config::Config::Params(VectorParams {
                        size: 1536, // OpenAI Ada-002 dimension
                        distance: Distance::Cosine.into(),
                        ..Default::default()
                    })),
                }),
                ..Default::default()
            }).await?;
        }

        Ok(Self { 
            qdrant, 
            openai, 
            collection: collection.to_string() 
        })
    }

    /// Add a thought/observation to Long Term Memory
    pub async fn remember(&self, text: &str) -> Result<(), anyhow::Error> {
        // 1. Generate Embedding
        // Cost Alert: This costs money. Batch this in production.
        let request = CreateEmbeddingRequestArgs::default()
            .model("text-embedding-ada-002")
            .input(text)
            .build()?;
            
        let response = self.openai.embeddings().create(request).await?;
        let vector = response.data[0].embedding.clone();

        // 2. Wrap in Qdrant Point
        let point = PointStruct::new(
            Uuid::new_v4().to_string(), // Random ID
            vector,
            // Store the original text as Payload so we can read it back
            Payload::from_json(serde_json::json!({ 
                "text": text,
                "timestamp": chrono::Utc::now().to_rfc3339()
            })),
        );

        // 3. Upsert
        self.qdrant.upsert_points(
            self.collection.clone(),
            None, 
            vec![point],
            None,
        ).await?;
        
        Ok(())
    }

    /// Retrieve relevant memories
    pub async fn recall(&self, query: &str, limit: u64) -> Result<Vec<String>, anyhow::Error> {
        // 1. Embed Query
        let request = CreateEmbeddingRequestArgs::default()
            .model("text-embedding-ada-002")
            .input(query)
            .build()?;
        let response = self.openai.embeddings().create(request).await?;
        let vector = response.data[0].embedding.clone();

        // 2. Search
        let search_result = self.qdrant.search_points(&SearchPoints {
            collection_name: self.collection.clone(),
            vector: vector,
            limit: limit,
            with_payload: Some(true.into()),
            // Add filtering here if you have Multi-Tenancy!
            // filter: Some(Filter::new_must(Condition::matches("user_id", "123"))),
            ..Default::default()
        }).await?;

        // 3. Extract Text from Payload
        let memories: Vec<String> = search_result.result.into_iter().filter_map(|p| {
            // "text" field in payload
            p.payload.get("text")?.as_str().map(|s| s.to_string())
        }).collect();
        
        Ok(memories)
    }
}
}

Context Paging: The MemGPT Pattern

How do large OSs handle limited RAM? Paging. They swap memory to Disk. MemGPT does the same for Agents.

The Context Window is RAM. The Vector DB is Disk. The Agent has special tools:

CoreMemory.append(text): Writes to System Prompt (Pinned RAM).
ArchivalMemory.search(query): Reads from Vector DB (Disk).
ArchivalMemory.insert(text): Writes to Vector DB (Disk).

The LLM decides what to keep in RAM and what to swap to Disk.

Memory Consolidation: Sleep Jobs

Humans consolidate memories during sleep. Agents need Offline Consolidation Jobs.

The “Dreaming” Pipeline (Cron Job):

Fetch all memories from the last 24h.
Clustering: Group related memories (“User asked about Python”, “User asked about Rust”).
Summarization: Replace 50 raw logs with 1 summary (“User is a polyglot programmer”).
Garbage Collection: Delete duplicate or trivial logs (“Hello”, “Ok”).

Infrastructure: Scaling Qdrant / Weaviate

Index building is CPU intensive. Search is Latency sensitive.

Reference Architecture:

Write Node (Indexer): High CPU. Batches updates. Rebuilds HNSW graphs.
Read Replicas: High RAM (cache vectors). Serve queries.
Sharding: Shard by User_ID. User A’s memories never mix with User B’s.

# Docker Compose for Qdrant Cluster
version: '3.8'
services:
  qdrant-primary:
    image: qdrant/qdrant:latest
    environment:
      - QDRANT__CLUSTER__ENABLED=true
  qdrant-node-1:
    image: qdrant/qdrant:latest
    environment:
      - QDRANT__BOOTSTRAP=qdrant-primary:6335

Troubleshooting: Why Does My Agent Forget?

Scenario 1: The Recency Bias

Symptom: Agent remembers what you said 2 minutes ago, but not 2 days ago.
Cause: Standard cosine search returns most relevant, not most recent. If “Hello” (today) has low similarity to “Project Specs” (yesterday), it won’t appear.
Fix: Recency-Weighted Scoring. $Score = CosineSim(q, d) \times Decay(time)$.

Scenario 2: Index Fragmentation

Symptom: Recall speed drops to 500ms.
Cause: Frequent updates (Insert/Delete) fragment the HNSW graph.
Fix: Optimize/Vacuum the index nightly.

Scenario 3: The Duplicate Memory

Symptom: Agent retrieves “My name is Alex” 5 times.
Cause: You inserted the same memory every time the user mentioned their name.
Fix: Deduplication. Before insert, query for semantic duplicates (Distance < 0.01). If found, update timestamp instead of inserting new.

Scenario 4: Cosine Similarity > 1.0?

Symptom: Metric returns 1.00001.
Cause: Floating point error or vectors not normalized.
Fix: Always normalize vectors ($v / ||v||$) before insertion.

Future Trends: Neural Turing Machines

Vector DBs are external. NTM / MANN (Memory Augmented Neural Networks): The memory is differentiable. The Network learns how to read/write memory during backprop. Currently research (DeepMind), but will replace manual Vector DB lookup eventually.

MLOps Interview Questions

Q: What is “HNSW”? A: Hierarchical Navigable Small World. The standard algorithm for Approximate Nearest Neighbor (ANN) search. It’s like a Skip List for high-dimensional vectors.
Q: Why not just fine-tune the LLM on the user’s data? A: Fine-tuning is slow and expensive. You can’t fine-tune after every chat message. Vector DB provides Instant Knowledge Update. (RAG > Fine-Tuning for facts).
Q: How do you handle “referential ambiguity”? A: User says “Delete it.” What is “it”? The Agent needs to query STM (History) to resolve “it” = “file.txt” before retrieving from LTM.
Q: What is the dimensionality of Ada-002? A: 1536 dimensions.
Q: How do you secure the Vector DB? A: RLS (Row Level Security) aka Filtering. Every query MUST have filter: { user_id: "alex" }. Failing to filter is a massive privacy breach (Data Leakage between users).

Glossary

HNSW: Graph-based algorithm for vector search.
Embeddings: converting text to numbers.
RAG: Retrieval Augmented Generation.
Semantic Search: Searching by meaning, not kw.

Summary Checklist

Filtering: Always filter by session_id or user_id. One user must never see another’s vectors.
Dimension Check: Ensure Embedding Model output (1536) matches DB Config. Mismatch = Crash.
Dedup: Hash content before inserting. Don’t store “Hi” 1000 times.
Backup: Vector DBs are stateful. Snapshot them to S3 daily.
Latency: Retrieval should be < 50ms. If > 100ms, check HNSW build parameters (m, ef_construct).

The MLOps Omni-Reference