Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

22.2 Context Management Across Boundaries

“Intelligence is the ability to maintain context over time.”

In a single chat session, context is easy: just append the message to the list. In a multi-model, multi-agent system, context is hard.

  • Fragmentation: Agent A has the user’s name. Agent B has the user’s credit card.
  • Drift: The conversation topic shifts, but the vector search is stuck on the old topic.
  • Overflow: 128k tokens is a lot, until you dump a 50MB log file into it.

This chapter details the Context Architecture required to facilitate high-fidelity conversations across distributed models.


22.2.1. The “Lost in the Middle” Phenomenon

Before we discuss storage, we must discuss Recall. LLMs are not databases. Research (Liu et al., 2023) shows that as context grows:

  1. Beginning: High Recall (Primacy Bias).
  2. Middle: Low Recall (The “Lost” Zone).
  3. End: High Recall (Recency Bias).

Implication for MLOps: Simply “stuffing” the context window is an Anti-Pattern. You must Optimize the context before sending it. A 4k prompt with relevant info outperforms a 100k prompt with noise.


22.2.2. Architecture: Context Tiering

We classify context into 3 Tiers based on Lifecycle and Latency.

TierNameStoragePersistenceLatencyExample
L1Hot ContextIn-Memory / RedisSession-Scoped< 5ms“The user just said ‘Yes’.”
L2Warm ContextVector DB / DynamoDBUser-Scoped< 100ms“User prefers Python over Java.”
L3Cold ContextS3 / Data LakeGlobal> 500ms“User’s billing history from 2022.”

The Architecture Diagram:

graph TD
    User -->|Message| Orchestrator
    
    subgraph "Context Assembly"
        Orchestrator -->|Read| L1(Redis: Hot)
        Orchestrator -->|Query| L2(Pinecone: Warm)
        Orchestrator -->|Search| L3(S3: Cold)
    end
    
    L1 --> LLM
    L2 --> LLM
    L3 --> LLM

22.2.3. Memory Pattern 1: The Rolling Window (FIFO)

The simplest form of memory. Logic: Keep the last N interactions. Pros: Cheap, fast, ensures Recency. Cons: Forgets the beginning (Instruction drift).

class RollingWindowMemory:
    def __init__(self, k=5):
        self.history = []
        self.k = k

    def add(self, user, ai):
        self.history.append({"role": "user", "content": user})
        self.history.append({"role": "assistant", "content": ai})
        
        # Prune
        if len(self.history) > self.k * 2:
            self.history = self.history[-self.k * 2:]
            
    def get_context(self):
        return self.history

Production Tip: Never prune the System Prompt. Use [System Prompt] + [Rolling Window].


22.2.4. Memory Pattern 2: The Conversational Summary

As the conversation gets long, we don’t drop tokens; we Compress them.

Logic: Every 5 turns, run a background LLM call to summarize the new turns and append to a “Running Summary”.

The Prompt:

Current Summary:
The user is asking about AWS EC2 pricing. They are interested in Spot Instances.

New Lines:
User: What about availability?
AI: Spot instances can be reclaimed with 2 min warning.

New Summary:
The user is asking about AWS EC2 pricing, specifically Spot Instances. The AI clarified that Spot instances have a 2-minute reclamation warning.

Implementation:

async def update_summary(current_summary, new_lines):
    prompt = f"Current: {current_summary}\nNew: {new_lines}\nUpdate the summary."
    return await small_llm.predict(prompt)

Pros: Infinite “duration” of memory. Cons: Loss of specific details (names, numbers). Hybrid Approach: Use Summary (for long term) + Rolling Window (for last 2 turns).


22.2.5. Memory Pattern 3: Vector Memory (RAG for Chat)

Store every interaction in a Vector Database. Retrieve top-k relevant past interactions based on the current query.

Scenario:

  • Turn 1: “I own a cat named Luna.”
  • … (100 turns about coding) …
  • Turn 102: “What should I feed my pet?”

Rolling Window: Forgotten. Summary: Might have been compressed to “User has a pet.” Vector Memory:

  • Query: “feed pet”
  • Search: Finds “I own a cat named Luna.”
  • Context: “User has a cat named Luna.”
  • Answer: “Since you have a cat, try wet food.”

Implementation:

import chromadb

class VectorMemory:
    def __init__(self):
        self.client = chromadb.Client()
        self.collection = self.client.create_collection("chat_history")
        
    def add(self, text):
        self.collection.add(
            documents=[text],
            metadatas=[{"timestamp": time.time()}],
            ids=[str(uuid.uuid4())]
        )
        
    def query(self, text):
        results = self.collection.query(
            query_texts=[text],
            n_results=3
        )
        return results['documents'][0]

Warning: Vectors capture Semantic Similarity, not Time. If user asks “What is my current plan?”, Vector DB might return “Plan A” (from yesterday) and “Plan B” (from today). You must use Timestamp Filtering or Recency Weighting.


22.2.6. Deep Dive: Redis as a Context Store

In production, local Python lists die with the pod. Redis is the de-facto standard for L1/L2 memory. Use Redis Lists for History and Redis JSON for Profile.

import redis
import json

r = redis.Redis(host='localhost', port=6379, db=0)

def save_turn(session_id, user_msg, ai_msg):
    # Atomic Push
    pipe = r.pipeline()
    pipe.rpush(f"hist:{session_id}", json.dumps({"role": "user", "content": user_msg}))
    pipe.rpush(f"hist:{session_id}", json.dumps({"role": "assistant", "content": ai_msg}))
    
    # TTL Management (Expire after 24h)
    pipe.expire(f"hist:{session_id}", 86400)
    pipe.execute()

def load_context(session_id, limit=10):
    # Fetch last N items
    items = r.lrange(f"hist:{session_id}", -limit, -1)
    return [json.loads(i) for i in items]

Compression at Rest: Redis costs RAM. Storing 1M sessions * 4k context * 50 bytes = 200GB. Optimization: Enable GZIP compression before writing to Redis. zlib.compress(json.dumps(...).encode())


22.2.7. The “Context Broker” Pattern

Don’t let every agent talk to Redis directly. Create a Context Microservice.

API Definition:

  • POST /context/{session_id}/append
  • GET /context/{session_id}?tokens=4000 (Smart Fetch)
  • POST /context/{session_id}/summarize (Trigger background compression)

Smart Fetch Logic: The Broker decides what to return to fit the token limit.

  1. Always return System Prompt (500 tokens).
  2. Return User Profile (Hot Facts) (200 tokens).
  3. Fill remaining space with Rolling Window (Recent History).
  4. If space remains, inject Vector Search results.

This centralized logic prevents “Context Overflow” errors in the agents.


22.2.9. Advanced Pattern: GraphRAG (Knowledge Graph Memory)

Vector databases are great for “fuzzy matching”, but terrible for Reasoning.

  • User: “Who is Alex’s manager?”
  • Vector DB: Returns documents containing “Alex” and “Manager”.
  • Graph DB: Traverses (Alex)-[:REPORTS_TO]->(Manager).

The Graph Memory Architecture: We extract Entities and Relationships from the conversation and store them in Neo4j.

Extraction Logic

PLAN_PROMPT = """
Extract entities and relations from this text.
Output JSON:
[{"head": "Alex", "relation": "HAS_ROLE", "tail": "Engineer"}]
"""

def update_graph(text):
    triples = llm.predict(PLAN_PROMPT, text)
    for t in triples:
        neo4j.run(f"MERGE (a:Person {{name: '{t['head']}'}})")
        neo4j.run(f"MERGE (b:Role {{name: '{t['tail']}'}})")
        neo4j.run(f"MERGE (a)-[:{t['relation']}]->(b)")

Retrieval Logic (GraphRAG)

When the user asks a question, we don’t just search vectors. We Traverse.

  1. Extract entities from Query: “Who manages Alex?” -> Alex.
  2. Lookup Alex in Graph.
  3. Expand 1-hop radius. Alex -> HAS_ROLE -> Engineer, Alex -> REPORTS_TO -> Sarah.
  4. Inject these facts into the Context Window.

Upside: Perfect factual consistency. Downside: High write latency (Graph updates are slow).


22.2.10. Optimization: Context Caching (KV Cache)

Sending the same 10k tokens of “System Prompt + Company Policies” on every request is wasteful.

  • Cost: You pay for input tokens every time.
  • Latency: The GPU has to re-compute the Key-Value (KV) cache for the prefix.

The Solution: Prompt Caching (e.g., Anthropic system block caching). By marking a block as “ephemeral”, the provider keeps the KV cache warm for 5 minutes.

Calculation of Savings

ComponentTokensHits/MinCost (No Cache)Cost (With Cache)
System Prompt5,000100$1.50$0.15 (Read 1x)
User History2,000100$0.60$0.60 (Unique)
Total7,000100$2.10$0.75

Savings: ~65%.

Implementation Strategy

Structure your prompt so the Static part is always at the top. Any dynamic content (User Name, Current Time) must be moved below the cached block, or you break the cache hash.

Bad: System: You are helpful. Current Time: 12:00. (Breaks every minute). Good: System: You are helpful. (Cache Break) User: Current Time is 12:00.


22.2.11. Compression Algorithms: LLMLingua

When you absolutely must fit 20k tokens into a 4k window. LLMLingua (Microsoft) uses a small model (Llama-2-7b) to calculate the Perplexity of each token in the context. It drops tokens with low perplexity (predictable/redundant tokens) and keeps high-perplexity ones (information dense).

from llmlingua import PromptCompressor

compressor = PromptCompressor()
original_context = "..." # 10,000 tokens
compressed = compressor.compress_prompt(
    original_context,
    instruction="Summarize this",
    question="What is the revenue?",
    target_token=2000
)

# Result is "broken English" but highly information dense
# "Revenue Q3 5M. Growth 10%."

Trade-off:

  • Pros: Fits huge context.
  • Cons: The compressed text is hard for humans to debug.
  • Use Case: RAG over financial documents.

22.2.12. Security: PII Redaction in Memory

Your memory system is a Toxic Waste Dump of PII. Emails, Phone Numbers, Credit Cards. If you store them raw in Redis/VectorDB, you violate GDPR/SOC2.

The Redaction Pipeline:

  1. Ingest: User sends message.
  2. Scan: Run Microsoft Presidio (NER model).
  3. Redact: Replace alex@google.com with <EMAIL_1>.
  4. Store: Save the redacted version to Memory.
  5. Map: Store the mapping <EMAIL_1> -> alex@google.com in a specialized Vault (short TTL).

De-Anonymization (at Inference): When the LLM generates “Please email <EMAIL_1>”, the Broker intercepts and swaps it back to the real email only at the wire level (HTTPS response). The LLM never “sees” the real email in its weights.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "Call me at 212-555-1234"
results = analyzer.analyze(text=text, entities=["PHONE_NUMBER"], language='en')
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)

print(anonymized.text) 
# "Call me at <PHONE_NUMBER>"

22.2.13. Deep Dive: Multi-Tenant Vector Isolation

A common outage: “I searched for ‘my contract’ and saw another user’s contract.” If you put all users in one Vector Index, k=1 might cross privacy boundaries.

The Filter Pattern (Weak Isolation):

results = collection.query(
    query_texts=["contract"],
    where={"user_id": "user_123"} # Filtering at query time
)

Risk: If the developer forgets the where clause, data leaks.

The Namespace Pattern (Strong Isolation): Most Vector DBs (Pinecone, Qdrant) support Namespaces.

  • Namespace: user_123
  • Namespace: user_456

The Query API requires a namespace. You literally cannot search “globally”. Recommendation: Use Namespaces for B2B SaaS (Tenant per Namespace). For B2C (1M users), Namespaces might be too expensive (depending on DB). Fallback to Partition Keys.


22.2.14. Case Study: The “Infinite” Memory Agent (MemGPT Pattern)

How do you chat with an AI for a year? MemGPT (Packer et al., 2023) treats Context like an OS (Operating System).

  • LLM Context Window = RAM (Fast, expensive, volatile).
  • Vector DB / SQL = Hard Drive (Slow, huge, persistent).

The Paging Mechanism: The OS (Agent) must explicitly “Page In” and “Page Out” data. It introduces a special tool: memory_manage.

The System Prompt:

You have limited memory.
Function `core_memory_replace(key, value)`: Updates your core personality.
Function `archival_memory_insert(text)`: Saves a fact to long-term storage.
Function `archival_memory_search(query)`: Retrieves facts.

Current Core Memory:
- Name: Alex
- Goal: Learn MLOps

Conversation Flow:

  1. User: “My favorite color is blue.”

  2. Agent Thought: “This is a new fact. I should save it.”

  3. Agent Action: archival_memory_insert("User's favorite color is blue").

  4. Agent Reply: “Noted.”

  5. (6 months layer) User: “What should I wear?”

  6. Agent Action: archival_memory_search("favorite color").

  7. Agent Reply: “ wear something blue.“

Key Takeaway for MLOps: You are not just serving a model; you are serving a Virtual OS. You need observability on “Memory I/O” operations.


22.2.15. Code Pattern: The Context Broker Microservice

Stop importing langchain in your backend API. Centralize context logic in a dedicated service.

The API Specification (FastAPI)

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ContextRequest(BaseModel):
    user_id: str
    query: str
    max_tokens: int = 4000

@app.post("/assemble")
def assemble_context(req: ContextRequest):
    # 1. Fetch Parallel (Async)
    # - User Profile (DynamoDB)
    # - Recent History (Redis)
    # - Relevant Docs (Pinecone)
    
    profile, history, docs = fetch_parallel(req.user_id, req.query)
    
    # 2. Token Budgeting
    budget = req.max_tokens - 500 (System Prompt)
    
    # A. Profile (Critical)
    budget -= count_tokens(profile)
    
    # B. History (Recency)
    # Take as much history as possible, leaving 1000 for Docs
    history_budget = max(0, budget - 1000)
    trimmed_history = trim_history(history, history_budget)
    
    # C. Docs (Relevance)
    docs_budget = budget - count_tokens(trimmed_history)
    trimmed_docs = select_best_docs(docs, docs_budget)
    
    return {
        "system_prompt": "...",
        "profile": profile,
        "history": trimmed_history,
        "knowledge": trimmed_docs,
        "debug_info": {
            "tokens_used": req.max_tokens - docs_budget,
            "docs_dropped": len(docs) - len(trimmed_docs)
        }
    }

Why this is crucial:

  • Consistency: Every agent gets the same context logic.
  • Auditability: You can log exactly what context was fed to the model (The debug_info).
  • Optimization: You can tune the ranking algorithm in one place.

22.2.16. Anti-Pattern: The “Session Leak”

Scenario:

  1. You use a global variable history = [] in your Python server.
  2. Request A comes in. history.append(A).
  3. Request B comes in (different user). history.append(B).
  4. Response to B includes A’s data.

The Fix: Stateless Services. Never use global variables for state. Always fetch state from Redis using session_id as the key. Local variables only.


22.2.17. Benchmarking RAG Latency

Retrieving context takes time. Is GraphRAG worth the wait?

MethodRetrieval Latency (P99)Accuracy (Recall@5)Use Case
Redis (Last N)5ms10%Chit-chat
Vector (Dense)100ms60%Q&A
Hybrid (Sparse+Dense)150ms70%Domain Search
Graph Traversal800ms90%Complex Reasoning
Agentic Search (Google)3000ms95%Current Events

Ops decision: set a Time Budget. “We have 500ms for Context Assembly.” This rules out Agentic Search and complex Graph traversals for real-time chat.


Google Gemini 1.5 Pro has a 1M - 10M token window. Does this kill RAG? No.

  1. Latency: Decoding 1M tokens takes 60 seconds (Time to First Token).
  2. Cost: Inputting 10 books ($50) for every question is bankrupting.
  3. Accuracy: “Lost in the Middle” still exists, just at a larger scale.

The Hybrid Future:

  • use RAG to find the relevant 100k tokens.
  • Use Long Context to reason over those 100k tokens. RAG becomes “Coarse Grain” filtering. Long Context becomes “Fine Grain” reasoning.

22.2.19. Reference: The Universal Context Schema

Standardize how you pass context between services.

{
  "$schema": "http://mlops-book.com/schemas/context-v1.json",
  "meta": {
    "session_id": "sess_123",
    "user_id": "u_999",
    "timestamp": 1698000000,
    "strategy": "hybrid"
  },
  "token_budget": {
    "limit": 4096,
    "used": 3500,
    "remaining": 596
  },
  "layers": [
    {
      "name": "system_instructions",
      "priority": "critical",
      "content": "You are a helpful assistant...",
      "tokens": 500,
      "source": "config_v2"
    },
    {
      "name": "user_profile",
      "priority": "high",
      "content": "User is a Premium subscriber. Location: NY.",
      "tokens": 150,
      "source": "dynamodb"
    },
    {
      "name": "long_term_memory",
      "priority": "medium",
      "content": "User previously asked about: Python, AWS.",
      "tokens": 300,
      "source": "vector_db"
    },
    {
      "name": "conversation_history",
      "priority": "low",
      "content": [
        {"role": "user", "content": "Hi"},
        {"role": "assistant", "content": "Hello"}
      ],
      "tokens": 50,
      "source": "redis"
    }
  ],
  "dropped_items": [
    {
      "reason": "budget_exceeded",
      "source": "vector_db_result_4",
      "tokens": 400
    }
  ]
}

Op Tip: Log this object to S3 for every request. If a user complains “The AI forgot my name”, you can check dropped_items.


22.2.20. Deep Dive: The Physics of Attention (Why Context is Expense)

Why can’t we just have infinite context? It’s not just RAM. It’s Compute. Attention is $O(N^2)$. If you double context length, compute cost quadruples.

The Matrix Math: For every token generated, the model must attend to every previous token.

Context LengthOperations per StepRelative Slowdown
4k$1.6 \times 10^7$1x
32k$1.0 \times 10^9$64x
128k$1.6 \times 10^{10}$1024x

Flash Attention (Dao et al.) reduces this, but the fundamental physics remains. “Context Stuffing” is computationally irresponsible. Only retrieve what you need.


22.2.21. Design Pattern: The Semantic semantic Router Cache

Combine Routing + Caching to save context lookups.

Logic:

  1. User: “How do I reset my password?”
  2. Embed query -> [0.1, 0.9, ...]
  3. Check Semantic Cache (Redis VSS).
    • If similar query found (“Change password?”), return cached response.
    • Optimization: You don’t even need to fetch the Context Profile/History if the answer is generic.
  4. If not found: Fetch Context -> LLM -> Cache Response.
def robust_entry_point(query, user_id):
    # 1. Fast Path (No Context Needed)
    if semantic_cache.hit(query):
        return semantic_cache.get(query)

    # 2. Slow Path (Context Needed)
    context = context_broker.assemble(user_id, query)
    response = llm.generate(context, query)
    
    # 3. Cache Decision
    if is_generic_answer(response):
        semantic_cache.set(query, response)
        
    return response

22.2.22. Anti-Pattern: The Token Hoarder

Scenario: “I’ll just put the entire 50-page PDF in the context, just in case.”

Consequences:

  1. Distraction: The model attends to irrelevant footnotes instead of the user’s question.
  2. Cost: $0.01 per request becomes $0.50 per request.
  3. Latency: TTFT jumps from 500ms to 5s.

The Fix: Chunking. Split the PDF into 500-token chunks. Retrieve top-3 chunks. Context size: 1500 tokens. Result: Faster, cheaper, more accurate.


22.2.23. The Context Manifesto

  1. Context is a Resource, not a Right. Budget it like money.
  2. LIFO is a Lie. The middle is lost. Structure context carefully.
  3. Static First. Put cached system prompts at the top.
  4. Metadata Matters. Inject timestamps and source IDs.
  5. Forget Gracefully. Summarize old turns; don’t just truncate them.

22.2.24. Deep Dive: KV Cache Eviction Policies

When the GPU memory fills up, which KV blocks do you evict? This is the “LRU vs LFU” problem of LLMs.

Strategies:

  1. FIFO (First In First Out): Drop the oldest turns. Bad for “First Instruction”.
  2. H2O (Heavy Hitters Oracle): Keep tokens that have high Attention Scores.
    • If a token (like “Not”) has high attention mass, keep it even if it’s old.
  3. StreamingLLM: Keep the “Attention Sink” (first 4 tokens) + Rolling Window.
    • Surprisingly, keeping the first 4 tokens stabilizes the attention mechanism.

Production Setting: Most Inference Servers (vLLM, TGI) handle this automatically with PagedAttention. Your job is just to monitor gpu_cache_usage_percent.


22.2.25. Implementation: Session Replay for Debugging

“Why did the bot say that?” To answer this, you need Time Travel. You need to see the context exactly as it was at T=10:00.

Event Sourcing Architecture: Don’t just store the current state. Store the Delta.

TABLE context_events (
    event_id UUID,
    session_id UUID,
    timestamp TIMESTAMP,
    event_type VARCHAR, -- 'APPEND', 'PRUNE', 'SUMMARIZE'
    payload JSONB
);

Replay Logic:

def replay_context(session_id, target_time):
    events = fetch_events(session_id, end_time=target_time)
    state = []
    
    for event in events:
        if event.type == 'APPEND':
            state.append(event.payload)
        elif event.type == 'PRUNE':
            state = state[-event.payload['keep']:]
    
    return state

This allows you to reproduce “Hallucinations due to Context Pruning”.


22.2.26. Code Pattern: The PII Guard Library

Don’t rely on the LLM to redact itself. It will fail. Use a regex-heavy Python class before storage.

import re

class PIIGuard:
    def __init__(self):
        self.patterns = {
            "EMAIL": r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+",
            "SSN": r"\d{3}-\d{2}-\d{4}",
            "CREDIT_CARD": r"\d{4}-\d{4}-\d{4}-\d{4}"
        }
        
    def scrub(self, text):
        redaction_map = {}
        scrubbed_text = text
        
        for p_type, regex in self.patterns.items():
            matches = re.finditer(regex, text)
            for i, m in enumerate(matches):
                val = m.group()
                placeholder = f"<{p_type}_{i}>"
                scrubbed_text = scrubbed_text.replace(val, placeholder)
                redaction_map[placeholder] = val
                
        return scrubbed_text, redaction_map

    def restore(self, text, redaction_map):
        for placeholder, val in redaction_map.items():
            text = text.replace(placeholder, val)
        return text

Unit Test: Input: “My email is alex@gmail.com”. Stored: “My email is <EMAIL_0>”. LLM Output: “Sending to <EMAIL_0>”. Restored: “Sending to alex@gmail.com”. Zero Leakage.


22.2.27. Reference: Atomic Context Updates (Redis Lua)

When two users talk to the same bot in parallel, you get Race Conditions.

  • Request A reads context.
  • Request B reads context.
  • Request A appends message.
  • Request B appends message (Overwriting A).

Solution: Redis Lua Scripting (Atomic).

-- append_context.lua
local key = KEYS[1]
local new_msg = ARGV[1]
local max_len = tonumber(ARGV[2])

-- Append
redis.call("RPUSH", key, new_msg)

-- Check Length
local current_len = redis.call("LLEN", key)

-- Trim if needed (FIFO)
if current_len > max_len then
    redis.call("LPOP", key)
end

return current_len

Python Call:

script = r.register_script(lua_code)
script(keys=["hist:sess_1"], args=[json.dumps(msg), 10])

This guarantees consistency even at 1000 requests/sec.


22.2.28. Case Study: The Healthcare “Long Context” Bot

The Challenge: A Hospital wants a chatbot for doctors to query patient history.

  • Patient history = 500 PDF pages (Charts, Labs, Notes).
  • Privacy = HIPAA (No data leaks).
  • Accuracy = Life or Death (No hallucinations).

The Architecture:

  1. Ingest (The Shredder):

    • PDFs are OCR’d.
    • PII Scrubbing: Patient Name replaces with PATIENT_ID. Doctor Name replaced with DOCTOR_ID.
    • Storage: Original PDF in Vault (S3 Standard-IA). Scrubbed Text in Vector DB.
  2. Context Assembly (The Hybrid Fetch):

    • Query: “Has the patient ever taken Beta Blockers?”
    • Vector Search: Finds “Metoprolol prescribed 2022”.
    • Graph Search: (Metoprolol)-[:IS_A]->(Beta Blocker).
    • Context Window: 8k tokens.
  3. The “Safety Sandwich”:

    • Pre-Prompt: “You are a medical assistant. Only use the provided context. If unsure, say ‘I don’t know’.”
    • Context: [The retrieved labs]
    • Post-Prompt: “Check your answer against the context. List sources.”
  4. Audit Trail:

    • Every retrieval is logged: “Dr. Smith accessed Lab Report 456 via query ‘Beta Blockers’.”
    • This log is immutable (S3 Object Lock).

Result:

  • Hallucinations dropped from 15% to 1%.
  • Doctors save 20 mins per patient preparation.

22.2.8. Summary Checklist

To manage context effectively:

  • Tier Your Storage: Redis for fast access, Vector DB for recall, S3 for logs.
  • Don’t Overstuff: Respect the “Lost in the Middle” phenomenon.
  • Summarize in Background: Don’t make the user wait for summarization.
  • Use a Broker: Centralize context assembly logic.
  • Handle Privacy: PII in context must be redacted or encrypted (Redis does not encrypt by default).
  • Use GraphRAG: For entity-heavy domains (Legal/Medical).
  • Cache Prefixes: Optimize the System Prompt order to leverage KV caching.
  • Budget Tokens: Implement strict token budgeting in a middleware layer.
  • Monitor Leaks: Ensure session isolation in multi-tenant environments.
  • Use Lua: For atomic updates to shared context.
  • Replay Events: Store context deltas for debugging.
  • Audit Retrieve: Log exactly which documents were used for an answer.

22.2.29. Glossary of Terms

  • Context Window: The maximum number of tokens a model can process (e.g., 128k).
  • FIFO Buffer: First-In-First-Out memory (Rolling Window).
  • RAG (Retrieval Augmented Generation): Boosting context with external data.
  • GraphRAG: boosting context with Knowledge Graph traversals.
  • Session Leak: Accidentally sharing context between two users.
  • Lost in the Middle: The tendency of LLMs to ignore information in the middle of a long prompt.
  • Token Budgeting: A hard limit on how many tokens each component (Profile, History, Docs) can consume.
  • KV Cache: Key-Value cache in the GPU, used to speed up generation by not re-computing the prefix.
  • Ephemeral Context: Context that lives only for the duration of the request (L1).

22.2.30. Anti-Pattern: The Recency Bias Trap

Scenario: You only feed the model the last 5 turns. User: “I agree.” Model: “Great.” User: “Let’s do it.” Model: “Do what?”

Cause: The “Goal” (defined 20 turns ago) fell out of the Rolling Window. Fix: The Goal must be pinned to the System Prompt layer, not the History layer. It must persist even if the chit-chat history is pruned.


22.2.31. Final Thought: Context as Capital

In the AI Economy, Proprietary Context is your moat. Everyone has GPT-4. Only you have the user’s purchase history, preference graph, and past conversations. Manage this asset with the same rigor you manage your Database. Zero leaks. fast access. High fidelity.