16.3 Caching: Semantic Caching for LLMs and Beyond
16.3.1 Introduction: The Economics of Inference Caching
The fastest inference is the one you don’t have to run. The cheapest GPU is the one you don’t have to provision. Caching is the ultimate optimization—it turns $0.01 inference costs into $0.0001 cache lookups, a 100x reduction.
The ROI of Caching
Consider a customer support chatbot serving 1 million queries per month:
Without Caching:
- Model: GPT-4 class (via API or self-hosted)
- Cost: $0.03 per 1k tokens (input) + $0.06 per 1k tokens (output)
- Average query: 100 input tokens, 200 output tokens
- Monthly cost: 1M × ($0.003 + $0.012) = $15,000
With 60% Cache Hit Rate:
- Cache hits: 600K × $0.00001 (Redis lookup) = $6
- Cache misses: 400K × $0.015 = $6,000
- Total: $6,006 (60% reduction)
With 80% Cache Hit Rate (achievable for FAQs):
- Cache hits: 800K × $0.00001 = $8
- Cache misses: 200K × $0.015 = $3,000
- Total: $3,008 (80% reduction)
The ROI is astronomical, especially for conversational AI where users ask variations of the same questions.
16.3.2 Caching Paradigms: Exact vs. Semantic
Traditional web caching is exact match: cache GET /api/users/123 and serve identical responses for identical URLs. ML inference requires a paradigm shift.
The Problem with Exact Match for NLP
Query 1: "How do I reset my password?"
Query 2: "How can I reset my password?"
Query 3: "password reset instructions"
These are semantically identical but lexically different. Exact match caching treats them as three separate queries, wasting 2 LLM calls.
Semantic Caching
Approach: Embed the query into a vector space, then search for semantically similar cached queries.
graph LR
Query[User Query:<br/>"How to reset password?"]
Embed[Embedding Model<br/>all-MiniLM-L6-v2]
Vector[Vector: [0.12, -0.45, ...]]
VectorDB[(Vector DB<br/>Redis/Qdrant)]
Query-->Embed
Embed-->Vector
Vector-->|Similarity Search|VectorDB
VectorDB-->|Sim > 0.95?|Decision{Hit?}
Decision-->|Yes|CachedResponse[Return Cached]
Decision-->|No|LLM[Call LLM]
LLM-->|Store|VectorDB
Algorithm:
- Embed the incoming query using a fast local model (e.g.,
sentence-transformers/all-MiniLM-L6-v2). - Search the vector database for the top-k most similar cached queries.
- Threshold: If
max_similarity > 0.95, return the cached response. - Miss: Call the LLM, store the
(query_embedding, response)pair.
16.3.3 Implementation: GPTCache
GPTCache is the industry-standard library for semantic caching.
Installation
pip install gptcache
pip install gptcache[onnx] # For local embedding
pip install redis
Basic Setup
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
# 1. Configure embedding model (runs locally)
onnx_embedding = Onnx() # Uses all-MiniLM-L6-v2 by default
# 2. Configure vector store (Redis)
redis_vector = VectorBase(
"redis",
host="localhost",
port=6379,
dimension=onnx_embedding.dimension, # 384 for MiniLM
collection="llm_cache"
)
# 3. Configure metadata store (SQLite for development, Postgres for production)
data_manager = get_data_manager(
data_base=CacheBase("sqlite"), # Stores query text and response text
vector_base=redis_vector
)
# 4. Initialize cache
cache.init(
pre_embedding_func=onnx_embedding.to_embeddings,
embedding_func=onnx_embedding.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
# Tuning parameters
similarity_threshold=0.95 # Require 95% similarity for a hit
)
# 5. Use OpenAI adapter (caching layer)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "How do I reset my password?"}
]
)
print(response.choices[0].message.content)
print(f"Cache hit: {response.get('gptcache', False)}")
On the second call with a similar query:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "password reset instructions"}
]
)
# This will be a cache hit if similarity > 0.95
print(f"Cache hit: {response.get('gptcache', True)}") # Likely True
Advanced Configuration
Production Setup (PostgreSQL + Redis):
from gptcache.manager import get_data_manager, CacheBase, VectorBase
import os
data_manager = get_data_manager(
data_base=CacheBase(
"postgresql",
sql_url=os.environ['DATABASE_URL'] # postgres://user:pass@host:5432/dbname
),
vector_base=VectorBase(
"redis",
host=os.environ['REDIS_HOST'],
port=6379,
password=os.environ.get('REDIS_PASSWORD'),
dimension=384,
collection="llm_cache_prod"
)
)
cache.init(
pre_embedding_func=onnx_embedding.to_embeddings,
embedding_func=onnx_embedding.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
similarity_threshold=0.95,
# Performance tuning
top_k=5, # Consider top 5 similar queries
max_size=1000000, # Max cache entries
eviction="LRU" # Least Recently Used eviction
)
16.3.4 Architecture: Multi-Tier Caching
For high-scale systems (millions of users), a single Redis instance isn’t enough. Implement a tiered strategy.
The L1/L2/L3 Pattern
graph TD
Request[User Request]
L1[L1: In-Memory LRU<br/>Python functools.lru_cache<br/>Latency: 0.001ms]
L2[L2: Redis Cluster<br/>Distributed Cache<br/>Latency: 5-20ms]
L3[L3: S3/GCS<br/>Large Artifacts<br/>Latency: 200ms]
LLM[LLM API<br/>Latency: 2000ms]
Request-->|Check|L1
L1-->|Miss|L2
L2-->|Miss|L3
L3-->|Miss|LLM
LLM-->|Store|L3
L3-->|Store|L2
L2-->|Store|L1
Implementation:
from functools import lru_cache
import redis
import pickle
import hashlib
# L1: In-process cache (per container/pod)
@lru_cache(maxsize=1000)
def l1_cache(query_hash):
return None # Will be populated
# L2: Redis
redis_client = redis.Redis(host='redis-cluster', port=6379)
def get_cached_response(query: str, embedding_func) -> str:
# Compute query hash
query_hash = hashlib.sha256(query.encode()).hexdigest()
# L1 check
result = l1_cache(query_hash)
if result:
print("L1 HIT")
return result
# L2 check (vector search)
embedding = embedding_func(query)
similar_queries = vector_db.search(embedding, top_k=1)
if similar_queries and similar_queries[0]['score'] > 0.95:
print("L2 HIT")
cached_response = similar_queries[0]['response']
# Populate L1
l1_cache.__wrapped__(query_hash, cached_response)
return cached_response
# L3 check (for large responses, e.g., generated images)
s3_key = f"responses/{query_hash}"
try:
obj = s3_client.get_object(Bucket='llm-cache', Key=s3_key)
response = obj['Body'].read().decode()
print("L3 HIT")
return response
except:
pass
# Cache miss: call LLM
print("CACHE MISS")
response = call_llm(query)
# Store in all tiers
vector_db.insert(embedding, response)
s3_client.put_object(Bucket='llm-cache', Key=s3_key, Body=response)
return response
16.3.5 Exact Match Caching for Deterministic Workloads
For non-LLM workloads (image generation, video processing), exact match caching is sufficient and simpler.
Use Case: Stable Diffusion Image Generation
If a user requests:
prompt="A sunset on Mars"
seed=42
steps=50
guidance_scale=7.5
The output is deterministic (given the same hardware/drivers). Re-generating it is wasteful.
Implementation with Redis:
import hashlib
import json
import redis
redis_client = redis.Redis(host='localhost', port=6379)
def cache_key(prompt: str, seed: int, steps: int, guidance_scale: float) -> str:
"""
Generate a deterministic cache key.
"""
payload = {
"prompt": prompt,
"seed": seed,
"steps": steps,
"guidance_scale": guidance_scale
}
# Sort keys to ensure {"a":1,"b":2} == {"b":2,"a":1}
canonical_json = json.dumps(payload, sort_keys=True)
return hashlib.sha256(canonical_json.encode()).hexdigest()
def generate_image(prompt: str, seed: int = 42, steps: int = 50, guidance_scale: float = 7.5):
key = cache_key(prompt, seed, steps, guidance_scale)
# Check cache
cached_image = redis_client.get(key)
if cached_image:
print("CACHE HIT")
return cached_image # Returns bytes (PNG)
# Cache miss: generate
print("CACHE MISS - Generating...")
image_bytes = run_stable_diffusion(prompt, seed, steps, guidance_scale)
# Store with 7-day TTL
redis_client.setex(key, 604800, image_bytes)
return image_bytes
# Usage
image1 = generate_image("A sunset on Mars", seed=42) # MISS
image2 = generate_image("A sunset on Mars", seed=42) # HIT (instant)
Cache Eviction Policies
Redis supports multiple eviction policies:
- noeviction: Return error when max memory is reached (not recommended)
- allkeys-lru: Evict least recently used keys (most common)
- volatile-lru: Evict least recently used keys with TTL set
- allkeys-lfu: Evict least frequently used keys (better for hot/cold data)
Configuration (redis.conf):
maxmemory 10gb
maxmemory-policy allkeys-lru
16.3.6 Cache Invalidation: The Hard Problem
“There are only two hard things in Computer Science: cache invalidation and naming things.” – Phil Karlton
Problem 1: Model Updates
You deploy model-v2 which generates different responses. Cached responses from model-v1 are now stale.
Solution: Version Namespacing
def cache_key(query: str, model_version: str) -> str:
payload = {
"query": query,
"model_version": model_version
}
return hashlib.sha256(json.dumps(payload, sort_keys=True).encode()).hexdigest()
# When calling the model
response = get_cached_response(query, model_version="v2.3.1")
When you deploy v2.3.2, the cache key changes, so old responses aren’t served.
Trade-off: You lose the cache on every deployment. For frequently updated models, this defeats the purpose.
Alternative: Dual Write
During a migration period:
- Read from both
v1andv2caches. - Write to
v2cache only. - After 7 days (typical cache TTL), all
v1entries expire naturally.
Problem 2: Fact Freshness (RAG Systems)
A RAG (Retrieval-Augmented Generation) system answers questions based on a knowledge base.
Scenario:
- User asks: “What is our Q3 revenue?”
- Document
financial-report-q3.pdfis indexed. - LLM response is cached.
- Document is updated (revised earnings).
- Cached response is now stale.
Solution 1: TTL (Time To Live)
Set a short TTL on cache entries for time-sensitive topics.
redis_client.setex(
key,
ttl=86400, # 24 hours
value=response
)
Solution 2: Document-Based Invalidation
Tag cache entries with the document IDs they reference.
# When caching
cache_entry = {
"query": "What is our Q3 revenue?",
"response": "Our Q3 revenue was $100M",
"document_ids": ["financial-report-q3.pdf"]
}
redis_client.hset(f"cache:{query_hash}", mapping=cache_entry)
redis_client.sadd(f"doc_index:financial-report-q3.pdf", query_hash)
# When document is updated
def invalidate_document(document_id: str):
# Find all cache entries referencing this document
query_hashes = redis_client.smembers(f"doc_index:{document_id}")
# Delete them
for qh in query_hashes:
redis_client.delete(f"cache:{qh.decode()}")
# Clear the index
redis_client.delete(f"doc_index:{document_id}")
16.3.7 Monitoring Cache Performance
Key Metrics
-
Hit Rate:
hit_rate = cache_hits / (cache_hits + cache_misses)Target: > 60% for general chatbots, > 80% for FAQ bots.
-
Latency Reduction:
avg_latency_with_cache = (hit_rate × cache_latency) + ((1 - hit_rate) × llm_latency)Example:
- Cache latency: 10ms
- LLM latency: 2000ms
- Hit rate: 70%
avg_latency = (0.7 × 10) + (0.3 × 2000) = 7 + 600 = 607msvs. without cache: 2000ms (3.3x faster)
-
Cost Savings:
monthly_savings = (cache_hits × llm_cost_per_request) - (cache_hits × cache_cost_per_request)
Instrumentation
import time
from prometheus_client import Counter, Histogram
cache_hits = Counter('cache_hits_total', 'Total cache hits')
cache_misses = Counter('cache_misses_total', 'Total cache misses')
cache_latency = Histogram('cache_lookup_latency_seconds', 'Cache lookup latency')
llm_latency = Histogram('llm_call_latency_seconds', 'LLM call latency')
def get_response(query: str):
start = time.time()
# Check cache
cached = redis_client.get(query)
if cached:
cache_hits.inc()
cache_latency.observe(time.time() - start)
return cached.decode()
cache_misses.inc()
# Call LLM
llm_start = time.time()
response = call_llm(query)
llm_latency.observe(time.time() - llm_start)
# Store in cache
redis_client.setex(query, 3600, response)
return response
Grafana Dashboard Queries:
# Hit rate
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))
# Average latency
(rate(cache_lookup_latency_seconds_sum[5m]) + rate(llm_call_latency_seconds_sum[5m])) /
(rate(cache_lookup_latency_seconds_count[5m]) + rate(llm_call_latency_seconds_count[5m]))
16.3.8 Advanced: Proactive Caching
Instead of waiting for a cache miss, predict what users will ask and pre-warm the cache.
Use Case: Documentation Chatbot
Analyze historical queries:
Top 10 queries:
1. "How do I install the SDK?" (452 hits)
2. "What is the API rate limit?" (389 hits)
3. "How to authenticate?" (301 hits)
...
Pre-warm strategy:
import schedule
def prewarm_cache():
"""
Run nightly to refresh top queries.
"""
top_queries = get_top_queries_from_analytics(limit=100)
for query in top_queries:
# Check if cached
embedding = embed(query)
cached = vector_db.search(embedding, top_k=1)
if not cached or cached[0]['score'] < 0.95:
# Generate and store
response = call_llm(query)
vector_db.insert(embedding, response)
print(f"Pre-warmed: {query}")
# Schedule for 2 AM daily
schedule.every().day.at("02:00").do(prewarm_cache)
16.3.9 Security Considerations
Cache Poisoning
An attacker could pollute the cache with malicious responses.
Attack:
- Attacker submits: “What is the admin password?”
- Cache stores: “The admin password is hunter2”
- Legitimate user asks the same question → gets the poisoned response.
Mitigation:
- Input Validation: Reject queries with suspicious patterns.
- Rate Limiting: Limit cache writes per user/IP.
- TTL: Short TTL limits the damage window.
- Audit Logging: Log all cache writes with user context.
PII (Personally Identifiable Information)
Cached responses may contain sensitive data.
Example:
Query: "What is my account balance?"
Response: "Your account balance is $5,234.12" (cached)
If cache is shared across users, this leaks data!
Solution: User-Scoped Caching
def cache_key(query: str, user_id: str) -> str:
payload = {"query": query, "user_id": user_id}
return hashlib.sha256(json.dumps(payload, sort_keys=True).encode()).hexdigest()
This ensures User A’s cached response is never served to User B.
16.3.10 Case Study: Hugging Face’s Inference API
Hugging Face serves millions of inference requests daily. Their caching strategy:
- Model-Level Caching: For identical inputs to the same model, serve cached outputs.
- Embedding Similarity: For text-generation tasks, use semantic similarity (threshold: 0.98).
- Regional Caches: Deploy Redis clusters in us-east-1, eu-west-1, ap-southeast-1 for low latency.
- Tiered Storage: Hot cache (Redis, 1M entries) → Warm cache (S3, 100M entries).
Results:
- 73% hit rate on average.
- P50 latency reduced from 1200ms to 45ms.
- Estimated $500k/month savings in compute costs.
16.3.11 Conclusion
Caching is the highest-ROI optimization in ML inference. It requires upfront engineering effort—embedding models, vector databases, invalidation logic—but the returns are extraordinary:
- 10-100x cost reduction for high-traffic systems.
- 10-50x latency improvement for cache hits.
- Scalability: Serve 10x more users without adding GPU capacity.
Best Practices:
- Start with exact match for deterministic workloads.
- Graduate to semantic caching for NLP/LLMs.
- Instrument everything: Hit rate, latency, cost savings.
- Plan for invalidation from day one.
- Security: User-scoped keys, rate limiting, audit logs.
Master caching, and you’ll build the fastest, cheapest inference systems on the planet.