Chapter 30.1: Vector Databases at Scale
“The hardest problem in computer science is no longer cache invalidation or naming things—it’s finding the one relevant paragraph in a billion documents in under 50 milliseconds.” — Architecture Note from a FAANG Search Team
30.1.1. The New Database Primitive
In the era of Generative AI, the Vector Database has emerged as a core component of the infrastructure stack, sitting alongside the Relational DB (OLTP), the Data Warehouse (OLAP), and the Key-Value Store (Caching). It is the long-term memory of the LLM.
The Role of the Vector Store in RAG
Retrieval Augmented Generation (RAG) relies on the premise that you can find relevant context for a query. This requires:
- Embedding: Converting text/images/audio into high-dimensional vectors.
- Indexing: Organizing those vectors for fast similarity search.
- Retrieval: Finding the “Nearest Neighbors” (ANN) to a query vector.
Taxonomy of Vector Stores
Not all vector stores are created equal. We see three distinct architectural patterns in the wild:
1. The Embedded Library (In-Process)
The database runs inside your application process.
- Examples: Chroma, LanceDB, FAISS (raw).
- Pros: Zero network latency, simple deployment (just a pip install).
- Cons: Scales only as far as the local disk/RAM; harder to share across multiple writer services.
- Use Case: Local development, single-node apps, “Chat with my PDF” tools.
2. The Native Vector Database (Standalone)
A dedicated distributed system built from scratch for vectors.
- Examples: Weaviate, Qdrant, Pinecone, Milvus.
- Pros: Purpose-built for high-scale, advanced filtering, hybrid search features.
- Cons: Another distributed system to manage (or buy).
- Use Case: Production RAG at scale, real-time recommendation systems.
3. The Vector-Enabled General Purpose DB
Adding vector capabilities to existing SQL/NoSQL stores.
- Examples: pgvector (Postgres), AWS OpenSearch, MongoDB Atlas, Redis.
- Pros: “Boring technology,” leverage existing backups/security/compliance, no new infrastructure.
- Cons: Often slower than native vector DBs at massive scale (billion+ vectors); vector search is a second-class citizen.
- Use Case: Enterprise apps where data gravity is in Postgres, medium-scale datasets (<100M vectors).
30.1.2. Architecture: AWS OpenSearch Serverless (Vector Engine)
AWS OpenSearch (formerly Elasticsearch) has added a serverless “Vector Engine” mode that decouples compute and storage providing a cloud-native experience.
Key Characteristics
- Decoupled Architecture: Storage is in S3, Compute is effectively stateless Indexing/Search Compute Units (OCUs).
- Algorithm: Uses NMSLIB (Non-Metric Space Library) implementing HNSW (Hierarchical Navigable Small World) graphs.
- Scale: Supports billions of vectors.
- Serverless: Auto-scaling of OCUs based on traffic.
Infrastructure as Code (Terraform)
Deploying a production-ready Serverless Vector Collection requires handling encryption, network policies, and data access policies.
# -----------------------------------------------------------------------------
# AWS OpenSearch Serverless: Vector Engine
# -----------------------------------------------------------------------------
resource "aws_opensearchserverless_collection" "rag_memory" {
name = "rag-prod-memory"
type = "VECTORSEARCH" # The critical flag
description = "Long-term memory for GenAI Platform"
depends_on = [
aws_opensearchserverless_security_policy.encryption
]
}
# 1. Encryption Policy (KMS)
resource "aws_opensearchserverless_security_policy" "encryption" {
name = "rag-encryption-policy"
type = "encryption"
description = "Encryption at rest for RAG contents"
policy = jsonencode({
Rules = [
{
ResourceType = "collection"
Resource = [
"collection/rag-prod-memory"
]
}
]
AWSOwnedKey = true # Or specify your own KMS ARN
})
}
# 2. Network Policy (VPC vs Public)
resource "aws_opensearchserverless_security_policy" "network" {
name = "rag-network-policy"
type = "network"
description = "Allow access from VPC and VPN"
policy = jsonencode([
{
Rules = [
{
ResourceType = "collection"
Resource = [
"collection/rag-prod-memory"
]
},
{
ResourceType = "dashboard"
Resource = [
"collection/rag-prod-memory"
]
}
]
AllowFromPublic = false
SourceVPCEs = [
aws_opensearchserverless_vpc_endpoint.main.id
]
}
])
}
# 3. Data Access Policy (IAM)
resource "aws_opensearchserverless_access_policy" "data_access" {
name = "rag-data-access"
type = "data"
description = "Allow RAG Lambda and SageMaker roles to read/write"
policy = jsonencode([
{
Rules = [
{
ResourceType = "collection"
Resource = [
"collection/rag-prod-memory"
]
Permission = [
"aoss:CreateCollectionItems",
"aoss:DeleteCollectionItems",
"aoss:UpdateCollectionItems",
"aoss:DescribeCollectionItems"
]
},
{
ResourceType = "index"
Resource = [
"index/rag-prod-memory/*"
]
Permission = [
"aoss:CreateIndex",
"aoss:DeleteIndex",
"aoss:UpdateIndex",
"aoss:DescribeIndex",
"aoss:ReadDocument",
"aoss:WriteDocument"
]
}
]
Principal = [
aws_iam_role.rag_inference_lambda.arn,
aws_iam_role.indexing_batch_job.arn,
data.aws_caller_identity.current.arn # Admin access
]
}
])
}
# VPC Endpoint for private access
resource "aws_opensearchserverless_vpc_endpoint" "main" {
name = "rag-vpce"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
security_group_ids = [
aws_security_group.opensearch_client_sg.id
]
}
Creating the Index (Python)
Once infrastructure is up, you define the index mapping.
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3
# Auth
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, 'us-east-1', 'aoss')
# Client
client = OpenSearch(
hosts=[{'host': 'Use-The-Collection-Endpoint.us-east-1.aoss.amazonaws.com', 'port': 443}],
http_auth=auth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection
)
# Define Index
index_name = "corp-knowledge-base-v1"
index_body = {
"settings": {
"index": {
"knn": True,
"knn.algo_param.ef_search": 100 # Tradeoff: Recall vs Latency
}
},
"mappings": {
"properties": {
"vector_embedding": {
"type": "knn_vector",
"dimension": 1536, # E.g., for OpenAI text-embedding-3-small
"method": {
"name": "hnsw",
"engine": "nmslib",
"space_type": "cosinesimil", # Cosine Similarity is standard for embeddings
"parameters": {
"ef_construction": 128,
"m": 24 # Max connections per node
}
}
},
"text_content": { "type": "text" }, # For Keyword search (Hybrid)
"metadata": {
"properties": {
"source": { "type": "keyword" },
"created_at": { "type": "date" },
"access_level": { "type": "keyword" }
}
}
}
}
}
if not client.indices.exists(index_name):
client.indices.create(index=index_name, body=index_body)
print(f"Index {index_name} created.")
30.1.3. Architecture: GCP Vertex AI Vector Search
Google’s offering (formerly Matching Engine) is based on ScaNN (Scalable Nearest Neighbors), a proprietary Google Research algorithm that often outperforms HNSW and IVFFlat in benchmarks.
Key Characteristics
- High Throughput: Capable of extremely high QPS (Queries Per Second).
- Recall/Performance: ScaNN uses anisotropic vector quantization which respects the dot product geometry better than standard K-means quantization.
- Architecture: Separate control plane (Index) and data plane (IndexEndpoint).
Infrastructure as Code (Terraform)
# -----------------------------------------------------------------------------
# GCP Vertex AI Vector Search
# -----------------------------------------------------------------------------
resource "google_storage_bucket" "vector_bucket" {
name = "gcp-ml-vector-store-${var.project_id}"
location = "US"
}
# 1. The Index (Logical Definition)
# Note: You generally create indexes via API/SDK in standard MLOps
# because they are immutable/versioned artifacts, but here is the TF resource.
resource "google_vertex_ai_index" "main_index" {
display_name = "production-knowledge-base"
description = "Main RAG index using ScaNN"
region = "us-central1"
metadata {
contents_delta_uri = "gs://${google_storage_bucket.vector_bucket.name}/indexes/v1"
config {
dimensions = 768 # E.g., for Gecko embeddings
approximate_neighbors_count = 150
distance_measure_type = "DOT_PRODUCT_DISTANCE"
algorithm_config {
tree_ah_config {
leaf_node_embedding_count = 500
leaf_nodes_to_search_percent = 7
}
}
}
}
index_update_method = "STREAM_UPDATE" # Enable real-time updates
}
# 2. The Index Endpoint (Serving Infrastructure)
resource "google_vertex_ai_index_endpoint" "main_endpoint" {
display_name = "rag-endpoint-public"
region = "us-central1"
network = "projects/${var.project_number}/global/networks/${var.vpc_network}"
}
# 3. Deployment (Deploy Index to Endpoint)
resource "google_vertex_ai_index_endpoint_deployed_index" "deployment" {
depends_on = [google_vertex_ai_index.main_index]
index_endpoint = google_vertex_ai_index_endpoint.main_endpoint.id
index = google_vertex_ai_index.main_index.id
deployed_index_id = "deployed_v1"
display_name = "production-v1"
dedicated_resources {
min_replica_count = 2
max_replica_count = 10
machine_spec {
machine_type = "e2-standard-16"
}
}
}
ScaNN vs. HNSW
Why choose Vertex/ScaNN?
- HNSW: Graph-based. Great per-query latency. Memory intensive (graph structure). Random access patterns (bad for disk).
- ScaNN: Quantization-based + Tree search. Higher compression. Google hardware optimization.
30.1.4. RDS pgvector: The “Just Use Postgres” Option
For many teams, introducing a new database (OpenSearch or Weaviate) is operational overhead they don’t want. pgvector is an extension for PostgreSQL that enables vector similarity search.
Why pgvector?
- Transactional: ACID compliance for your vectors.
- Joins: Join standard SQL columns with vector search results in one query.
- Familiarity: It’s just Postgres.
Infrastructure (Terraform)
resource "aws_db_instance" "postgres" {
identifier = "rag-postgres-db"
engine = "postgres"
engine_version = "15.3"
instance_class = "db.r6g.xlarge" # Memory optimized for vectors
allocated_storage = 100
# Ensure you install the extension
# Note: You'll typically do this in a migration script, not Terraform
}
SQL Implementation
-- 1. Enable Extension
CREATE EXTENSION IF NOT EXISTS vector;
-- 2. Create Table
CREATE TABLE documents (
id bigserial PRIMARY KEY,
content text,
metadata jsonb,
embedding vector(1536) -- OpenAI dimension
);
-- 3. Create HNSW Index (Vital for performance!)
-- ivfflat is simpler but hnsw is generally preferred for recall/performance
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- 4. Query (KNN)
SELECT content, metadata, 1 - (embedding <=> '[...vector...]') as similarity
FROM documents
ORDER BY embedding <=> '[...vector...]' -- <=> is cosine distance operator
LIMIT 5;
-- 5. Hybrid Query (SQL + Vector)
SELECT content
FROM documents
WHERE metadata->>'category' = 'finance' -- SQL Filter
ORDER BY embedding <=> '[...vector...]'
LIMIT 5;
30.1.5. Deep Dive: Indexing Algorithms and Tuning
The choice of index algorithm dictates the “Recall vs. Latency vs. Memory” triangle. Understanding the internals of these algorithms is mandatory for tuning production systems.
1. Inverted File Index (IVF-Flat)
IVF allows you to speed up search by clustering the vector space and only searching a subset.
- Mechanism:
- Training: Run K-Means on a sample of data to find $C$ centroids (where
nlist= $C$). - Indexing: Assign every vector in the dataset to its nearest centroid.
- Querying: Find the closest
nprobecentroids to the query vector. Search only the vectors in those specific buckets.
- Training: Run K-Means on a sample of data to find $C$ centroids (where
- Parameters:
nlist: Number of clusters. Recommendation: $4 \times \sqrt{N}$ (where $N$ is total vectors).nprobe: Number of buckets to search.nprobe = 1: Fast, low recall. (Only search the absolute closest bucket).nprobe = nlist: Slow, perfect recall (Brute force).- Sweet spot: Typically 1-5% of
nlist.
2. Product Quantization (PQ) with IVF (IVF-PQ)
IVF reduces the search scope, but PQ reduces the memory footprint.
- Mechanism:
- Split the high-dimensional vector (e.g., 1024 dims) into $M$ sub-vectors (e.g., 8 sub-vectors of 128 dims).
- Run K-means on each subspace to create a codebook.
- Replace the float32 values with the centroid ID (usually 1 byte).
- Result: Massive compression (e.g., 32x to 64x).
- Trade-off: PQ introduces loss. Distances are approximated. You might miss the true nearest neighbor because the vector was compressed.
- Refinement: Often used with a “Re-ranking” step where you load the full float32 vectors for just the top-k candidates to correct the order.
3. Hierarchical Navigable Small World (HNSW)
HNSW is the industry standard for in-memory vector search because it offers logarithmic complexity $O(\log N)$ with high recall.
- Graph Structure:
- It’s a multi-layered graph (a Skip List for graphs).
- Layer 0: Contains all data points (dense).
- Layer K: Contains a sparse subset of points serving as “expressways”.
- Search Process:
- Enter at the top layer.
- Greedily traverse to the nearest neighbor in that layer.
- “Descend” to the next layer down, using that node as the entry point.
- Repeat until Layer 0.
- Tuning
M(Max Connections):- Controls memory usage and recall.
- Range: 4 to 64.
- Higher
M= Better Recall, robust against “islands” in the graph, but higher RAM usage per vector.
- Tuning
ef_construction:- Size of the dynamic candidate list during index build.
- Higher = Better quality graph (fewer disconnected components), significantly slower indexing.
- Rule of Thumb:
ef_construction$\approx 2 \times M$.
- Tuning
ef_search:- Size of the candidate list during query.
- Higher = Better Recall, Higher Latency.
- Dynamic Tuning: You can change
ef_searchat runtime without rebuilding the index! This is your knob for “High Precision Mode” vs “High Speed Mode”.
4. DiskANN (Vamana Graph)
As vector datasets grow to 1 billion+ (e.g., embedding every paragraph of a corporate SharePoint history), RAM becomes the bottleneck. HNSW requires all nodes in memory.
DiskANN solves this by leveraging modern NVMe SSD speeds.
- Vamana Graph: A graph structure designed to minimize the number of hops (disk reads) to find a neighbor.
- Mechanism:
- Keep a compressed representation (PQ) in RAM for fast navigation.
- Keep full vectors on NVMe SSD.
- During search, use RAM to narrow down candidates.
- Fetch full vectors from disk only for final distance verification.
- Cost: Store 1B vectors on $200 of SSD instead of $5000 of RAM.
30.1.6. Capacity Planning and Sizing Guide
Sizing a vector cluster is more complex than a standard DB because vectors are computationally heavy (distance calculations) and memory heavy.
1. Storage Calculation
Vectors are dense float arrays. $$ Size_{GB} = \frac{N \times D \times 4}{1024^3} $$
- $N$: Number of vectors.
- $D$: Dimensions.
- $4$: Bytes per float32.
Overhead:
- HNSW: Adds overhead for storing graph edges. Add ~10-20% for links.
- Metadata: Don’t forget the JSON metadata stored with vectors! Often larger than the vector itself.
Example:
- 100M Vectors.
- OpenAI
text-embedding-3-small(1536 dims). - 1KB Metadata per doc.
- Vector Size: $100,000,000 \times 1536 \times 4 \text{ bytes} \approx 614 \text{ GB}$.
- Metadata Size: $100,000,000 \times 1 \text{ KB} \approx 100 \text{ GB}$.
- Index Overhead (HNSW): ~100 GB.
- Total: ~814 GB of RAM (if using HNSW) or Disk (if using DiskANN).
2. Compute Calculation (QPS)
QPS depends on ef_search and CPU cores.
- Recall vs Latency Curve:
- For 95% Recall, you might get 1000 QPS.
- For 99% Recall, you might drop to 200 QPS.
- Sharding:
- Vector search is easily parallelizable.
- Throughput Sharding: Replicate the entire index to multiple nodes. Load balance queries.
- Data Sharding: Split the index into 4 parts. Query all 4 in parallel, merge results (Map-Reduce). Necessary when index > RAM.
30.1.7. Production Challenges & Anti-Patterns
1. The “Delete” Problem
HNSW graphs are hard to modify. Deleting a node leaves a “hole” in the graph connectivity.
- Standard Implementation: “Soft Delete” (mark as deleted).
- Consequence: Over time, the graph quality degrades, and the “deleted” nodes still consume RAM and are processed during search (just filtered out at the end).
- Fix: Periodic “Force Merge” or “Re-index” operations are required to clean up garbage. Treat vector indexes as ephemeral artifacts that are rebuilt nightly/weekly.
2. The “Update” Problem
Updating a vector (re-embedding a document) is effectively a Delete + Insert.
- Impact: High write churn kills read latency in HNSW.
- Architecture: Separate Read/Write paths.
- Lambda Architecture:
- Batch Layer: Rebuild absolute index every night.
- Speed Layer: Small in-memory index for today’s data.
- Query: Search both, merge results.
- Lambda Architecture:
3. Dimensionality Curse
Higher dimensions = Better semantic capture? Not always.
- Going from 768 (BERT) to 1536 (OpenAI) doubles memory and halves speed.
- MRL (Matryoshka): See Chapter 30.2. Use dynamic shortening to save cost.
30.1.8. Security: Infrastructure as Code for Multi-Tenant Vector Stores
If you are building a RAG platform for multiple internal teams (HR, Engineering, Legal), you must segregate data.
Strategy 1: Index-per-Tenant
- Pros: Hard isolation. Easy to delete tenant data.
- Cons: Resource waste (overhead per index).
Strategy 2: Filter-based Segregation
All vectors in one big index, with a tenant_id field.
- Pros: Efficient resource usage.
- Cons: One bug in filter logic leaks Legal data to Engineering.
Terraform for Secure OpenSearch
Implementing granular IAM for index-level access.
# IAM Policy for restricting access to specific indices
resource "aws_iam_policy" "hr_only_policy" {
name = "rag-hr-data-access"
description = "Access only HR indices"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["aoss:APIAccessAll"]
Resource = ["arn:aws:aoss:us-east-1:123456789012:collection/rag-prod"]
Condition = {
"StringEquals": {
"aoss:index": "hr-*"
}
}
}
]
})
}
30.1.9. Benchmarking Framework
Never trust vendor benchmarks (“1 Million QPS!”). Run your own with your specific data distribution and vector dimension.
VectorDBBench
A popular open-source tool for comparing vector DBs.
pip install vectordb-bench
# Run a standard benchmark
vectordb-bench run \
--db opensearch \
--dataset gist-960-euclidean \
--test_cases performance \
--output_dir ./results
Key Metrics to Measure
- QPS at 99% Recall: The only metric that matters. High QPS at 50% recall is useless.
- P99 Latency: RAG is a chain; high tail latency breaks the UX.
- Indexing Speed: How long to ingest 10M docs? (Critical for disaster recovery).
- TCO per Million Vectors: Hardware cost + license cost.
30.1.10. Detailed Comparison Matrix
| Feature | AWS OpenSearch Serverless | Vertex AI Vector Search | pgvector (RDS) | Pinecone (Serverless) |
|---|---|---|---|---|
| Core Algo | HNSW (NMSLIB) | ScaNN | HNSW / IVFFlat | Proprietary Graph |
| Engine | Lucene-based | Google Research | Postgres Extension | Proprietary |
| Storage Tier | S3 (decoupled) | GCS | EBS (coupled) | S3 (decoupled) |
| Upsert Speed | Moderate (~seconds) | Fast (streaming) | Fast (transactional) | Fast |
| Cold Start | Yes (OCU spinup) | No (Always on) | No | Yes |
| Hybrid Search | Native (Keyword+Vector) | Limited (mostly vector) | Native (SQL+Vector) | Native (Sparse-Dense) |
| Metadata Filter | Efficient | Efficient | Very Efficient | Efficient |
| Cost Model | Per OCU-hour | Per Node-hour | Instance Size | Usage-based |
Decision Guide
- Choose AWS OpenSearch if: You are already deep in AWS, need FIPS compliance, and want “Serverless” scaling.
- Choose Vertex AI if: You have massive scale (>100M), strict latency budgets (<10ms), and Google-level recall needs.
- Choose pgvector if: You have <10M vectors, need ACID transactions, want to keep stack simple (one DB).
- Choose Pinecone if: You want zero infrastructure management and best-in-class developer experience.
30.1.12. Integration with Feature Stores
In a mature MLOps stack, the Vector Database does not live in isolation. It often effectively acts as a “candidate generator” that feeds into a more complex ranking system powered by a Feature Store.
The “ Retrieve -> Enrich -> Rank“ Pattern
- Retrieve (Vector DB): Get top 100 items suitable for the user (based on embedding similarity).
- Enrich (Feature Store): Fetch real-time features for those 100 items (e.g., “click_count_last_hour”, “stock_status”, “price”).
- Rank (XGBoost/LLM): Re-score the items based on the fresh feature data.
Why not store everything in the Vector DB?
Vector DBs are eventually consistent and optimized for immutable data. They are terrible at high-velocity updates (like “view count”).
- Vector DB: Stores Description embedding (Static).
- Feature Store (Redis/Feast): Stores Price, Inventory, Popularity (Dynamic).
Code: Feast + Qdrant Integration
from feast import FeatureStore
from qdrant_client import QdrantClient
# 1. Retrieve Candidates (Vector DB)
q_client = QdrantClient("localhost")
hits = q_client.search(
collection_name="products",
query_vector=user_embedding,
limit=100
)
product_ids = [hit.payload['product_id'] for hit in hits]
# 2. Enrich (Feast)
store = FeatureStore(repo_path=".")
feature_vector = store.get_online_features(
features=[
"product_stats:view_count_1h",
"product_stats:conversion_rate_24h",
"product_stock:is_available"
],
entity_rows=[{"product_id": pid} for pid in product_ids]
).to_dict()
# 3. Rank (Custom Logic)
ranked_products = []
for pid, views, conv, avail in zip(product_ids, feature_vector['view_count_1h'], ...):
if not avail: continue # Filter OOS
score = (views * 0.1) + (conv * 50) # Simple heuristic
ranked_products.append((pid, score))
ranked_products.sort(key=lambda x: x[1], reverse=True)
30.1.13. Multimodal RAG: Beyond Text
RAG is no longer just for text. Multimodal RAG allows searching across images, audio, and video using models like CLIP (Contrastive Language-Image Pre-Training).
Architecture
- Embedding Model: CLIP (OpenAI) or SigLIP (Google). Maps Image and Text to the same vector space.
- Storage:
- Vector DB: Stores the embedding.
- Object Store (S3): Stores the actual JPEG/PNG.
- Metadata: Stores the S3 URI (
s3://bucket/photo.jpg).
CLIP Search Implementation
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# 1. Indexing an Image
image = Image.open("dog.jpg")
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
vector_db.add(id="dog_1", vector=image_features.detach().numpy())
# 2. Querying with Text ("Find me images of dogs")
text_inputs = processor(text=["a photo of a dog"], return_tensors="pt")
text_features = model.get_text_features(**text_inputs)
results = vector_db.search(text_features.detach().numpy())
# 3. Querying with Image ("Find images like this one")
# Just use get_image_features() on the query image and search.
Challenges
- Storage Cost: Images are large. Don’t store Base64 blobs in the Vector DB payload. It kills performance.
- Latency: CLIP inference is heavier than BERT.
30.1.14. Compliance: GDPR and “Right to be Forgotten”
Vector Databases are databases. They contain PII. You must be able to delete data from them.
The “Deletion” Nightmare
As discussed in the Indexing section, HNSW graphs hate deletions. However, GDPR Article 17 requires “Right to Erasure” within 30 days.
Strategies
- Partitioning by User: If you have a B2C app, create a separate index (or partition) per user. Deleting a user = Dropping the partition.
- Feasible for: “Chat with my Data” apps.
- Impossible for: Application-wide search.
- Crypto-Shredding:
- Encrypt the metadata payload with a per-user key.
- Store the key in a separate KMS.
- To “delete” the user, destroy the key. The data is now garbage.
- Note: This doesn’t remove the vector from the graph, so the user might still appear in search results (as a generic blob), but the content is unreadable.
- The “Blacklist” Filter:
- Maintain a Redis set of
deleted_doc_ids. - Apply this as a mandatory excludes-filter on every query.
- Rebuild the index monthly to permanently purge the data.
- Maintain a Redis set of
30.1.15. Cost Analysis: Build vs. Buy
The most common question from leadership: “Why does this cost $5,000/month?”
Scenario: 50 Million Vectors (Enterprise Scale)
- Dimensions: 1536 (OpenAI).
- Traffic: 100 QPS.
Option A: Managed (Pinecone / Weaviate Cloud)
- Pricing: Usage-based (Storage + Read Units + Write Units).
- Storage: ~$1000/month (for pod-based systems).
- Compute: Usage based.
- Total: ~$1,500 - $3,000 / month.
- Ops Effort: Near Zero.
Option B: Self-Hosted (AWS OpenSearch Managed)
- Data Nodes: 3x
r6g.2xlarge(64GB RAM each).- Cost: $0.26 * 24 * 30 * 3 = $561.
- Master Nodes: 3x
m6g.large.- Cost: $0.08 * 24 * 30 * 3 = $172.
- Storage (EBS): 1TB gp3.
- Cost: ~$100.
- Total: ~$833 / month.
- Ops Effort: Medium (Upgrades, resizing, dashboards).
Option C: DIY (EC2 + Qdrant/Milvus)
- Nodes: 3x Spot Instances
r6g.2xlarge.- Cost: ~$200 / month (Spot pricing).
- Total: ~$200 - $300 / month.
- Ops Effort: High (Kubernetes, HA, Spot interruption handling).
Verdict: Unless you are Pinterest or Uber, Managed or Cloud Native (OpenSearch) is usually the right answer. The engineering time spent fixing a corrupted HNSW graph on a Saturday is worth more than the $1000 savings.
30.1.17. Case Study: Migrating to Billion-Scale RAG at “FinTechCorp”
Scaling from a POC (1 million docs) to Enterprise Search (1 billion docs) breaks almost every assumption you made in the beginning.
The Challenge
FinTechCorp had 20 years of PDF financial reports.
- Volume: 500 Million pages.
- Current Stack: Elasticsearch (Keyword).
- Goal: “Chat with your Documents” for 5,000 analysts.
Phase 1: The POC (Chroma)
- Setup: Single Python server running ChromaDB.
- Result: Great success on 100k docs. Analysts loved the semantic search.
- Failure: When they loaded 10M docs, the server crashed with OOM (Out of Memory). HNSW requires RAM.
Phase 2: The Scale-Out (OpenSearch + DiskANN)
- Decision: They couldn’t afford 5TB of RAM to hold the HNSW graph.
- Move: Switched to OpenSearch Service with
nmslib. - Optimization:
- Quantization: Used
byte-quantized vectors (8x memory reduction) at the cost of slight precision loss. - Sharding: Split the index into 20 shards across 6 data nodes.
- Quantization: Used
Phase 3: The Ingestion Bottleneck
- Problem: Re-indexing took 3 weeks.
- Fix: Built a Spark job on EMR to generate embeddings in parallel (1000 node cluster) and bulk-load into OpenSearch.
Outcome
- Latency: 120ms (P99).
- Recall: 96% compared to brute force.
- Cost: $4,500/month (Managed instances + Storage).
30.1.18. War Story: The “NaN” Embedding Disaster
“Production is down. Search is returning random results. It thinks ‘Apple’ is similar to ‘Microscope’.”
The Incident
On a Tuesday afternoon, the accuracy of the RAG system plummeted to zero. Users searching for “Quarterly Results” got documents about “Fire Safety Procedures.”
The Investigation
- Logs: No errors. 200 OK everywhere.
- Debug: We inspected the vectors. We found that 0.1% of vectors contained
NaN(Not a Number). - Root Cause: The embedding model (BERT) had a bug where certain Unicode characters (emoji + Zalgo text) caused a division by zero in the LayerNorm layer.
- Propagation: Because HNSW uses distance calculations, one
NaNin the graph “poisoned” the distance metrics for its neighbors during the index build, effectively corrupting the entire graph structure.
The Fix
- Validation: Added a schema check in the ingestion pipeline:
assert not np.isnan(vector).any(). - Sanitization: Stripped non-printable characters before embedding.
- Rebuild: Had to rebuild the entire 50M vector index from scratch (took 24 hours).
Lesson: Never trust the output of a neural network. Always validate mathematical properties (Norm length, NaN checks) before indexing.
30.1.19. Interview Questions
If you are interviewing for an MLOps role focusing on Search/RAG, expect these questions.
Q1: What is the difference between HNSW and IVF?
- Answer: HNSW is a graph-based algorithm. It allows logarithmic traversal but consumes high memory because it stores edges. It generally has better recall. IVF is a clustering-based algorithm. It partitions the space into Voronoi cells. It is faster to train and uses less memory (especially with PQ), but recall can suffer at partition boundaries.
Q2: How do you handle metadata filtering in Vector Search?
- Answer: Explain the difference between Post-filtering (bad recall) and Pre-filtering (slow). Mention “Filtered ANN” where the index traversal skips nodes that don’t match the bitmask.
Q3: What is the “Curse of Dimensionality” in vector search?
- Answer: As dimensions increase, the distance between the nearest and farthest points becomes negligible, making “similarity” meaningless. Also, computational cost scales linearly with $D$. Dimensionality reduction (PCA or Matryoshka) helps.
Q4: How would you scale a vector DB to 100 Billion vectors?
- Answer: RAM is the bottleneck. I would use:
- Disk-based Indexing (DiskANN/Vamana) to store vectors on NVMe.
- Product Quantization (PQ) to compress vectors by 64x.
- Sharding: Horizontal scaling across hundreds of nodes.
- Tiered Storage: Hot data in RAM/HNSW, cold data in S3/Faiss-Flat.
30.1.20. Summary
The vector database is the hippocampus of the AI application.
- Don’t over-engineer: Start with
pgvectororChromafor prototypes. - Plan for scale: Move to OpenSearch or Vertex when you hit 10M vectors.
- Tune your HNSW: Default settings are rarely optimal. Use the formula.
- Capacity Plan: Vectors are RAM-hungry. Calculate costs early.
- Monitor Recall: Latency is easy to measure; recall degradation is silent. Periodically test against a brute-force ground truth.
- Respect Compliance: Have a “Delete” button that actually works.
- Validate Inputs: Beware of
NaNvectors!