Chapter 30.1: Vector Databases at Scale

“The hardest problem in computer science is no longer cache invalidation or naming things—it’s finding the one relevant paragraph in a billion documents in under 50 milliseconds.” — Architecture Note from a FAANG Search Team

30.1.1. The New Database Primitive

In the era of Generative AI, the Vector Database has emerged as a core component of the infrastructure stack, sitting alongside the Relational DB (OLTP), the Data Warehouse (OLAP), and the Key-Value Store (Caching). It is the long-term memory of the LLM.

The Role of the Vector Store in RAG

Retrieval Augmented Generation (RAG) relies on the premise that you can find relevant context for a query. This requires:

Embedding: Converting text/images/audio into high-dimensional vectors.
Indexing: Organizing those vectors for fast similarity search.
Retrieval: Finding the “Nearest Neighbors” (ANN) to a query vector.

Taxonomy of Vector Stores

Not all vector stores are created equal. We see three distinct architectural patterns in the wild:

1. The Embedded Library (In-Process)

The database runs inside your application process.

Examples: Chroma, LanceDB, FAISS (raw).
Pros: Zero network latency, simple deployment (just a pip install).
Cons: Scales only as far as the local disk/RAM; harder to share across multiple writer services.
Use Case: Local development, single-node apps, “Chat with my PDF” tools.

2. The Native Vector Database (Standalone)

A dedicated distributed system built from scratch for vectors.

Examples: Weaviate, Qdrant, Pinecone, Milvus.
Pros: Purpose-built for high-scale, advanced filtering, hybrid search features.
Cons: Another distributed system to manage (or buy).
Use Case: Production RAG at scale, real-time recommendation systems.

3. The Vector-Enabled General Purpose DB

Adding vector capabilities to existing SQL/NoSQL stores.

Examples: pgvector (Postgres), AWS OpenSearch, MongoDB Atlas, Redis.
Pros: “Boring technology,” leverage existing backups/security/compliance, no new infrastructure.
Cons: Often slower than native vector DBs at massive scale (billion+ vectors); vector search is a second-class citizen.
Use Case: Enterprise apps where data gravity is in Postgres, medium-scale datasets (<100M vectors).

30.1.2. Architecture: AWS OpenSearch Serverless (Vector Engine)

AWS OpenSearch (formerly Elasticsearch) has added a serverless “Vector Engine” mode that decouples compute and storage providing a cloud-native experience.

Key Characteristics

Decoupled Architecture: Storage is in S3, Compute is effectively stateless Indexing/Search Compute Units (OCUs).
Algorithm: Uses NMSLIB (Non-Metric Space Library) implementing HNSW (Hierarchical Navigable Small World) graphs.
Scale: Supports billions of vectors.
Serverless: Auto-scaling of OCUs based on traffic.

Infrastructure as Code (Terraform)

Deploying a production-ready Serverless Vector Collection requires handling encryption, network policies, and data access policies.

# -----------------------------------------------------------------------------
# AWS OpenSearch Serverless: Vector Engine
# -----------------------------------------------------------------------------

resource "aws_opensearchserverless_collection" "rag_memory" {
  name        = "rag-prod-memory"
  type        = "VECTORSEARCH" # The critical flag
  description = "Long-term memory for GenAI Platform"

  depends_on = [
    aws_opensearchserverless_security_policy.encryption
  ]
}

# 1. Encryption Policy (KMS)
resource "aws_opensearchserverless_security_policy" "encryption" {
  name        = "rag-encryption-policy"
  type        = "encryption"
  description = "Encryption at rest for RAG contents"

  policy = jsonencode({
    Rules = [
      {
        ResourceType = "collection"
        Resource = [
          "collection/rag-prod-memory"
        ]
      }
    ]
    AWSOwnedKey = true # Or specify your own KMS ARN
  })
}

# 2. Network Policy (VPC vs Public)
resource "aws_opensearchserverless_security_policy" "network" {
  name        = "rag-network-policy"
  type        = "network"
  description = "Allow access from VPC and VPN"

  policy = jsonencode([
    {
      Rules = [
        {
          ResourceType = "collection"
          Resource = [
            "collection/rag-prod-memory"
          ]
        },
        {
          ResourceType = "dashboard"
          Resource = [
            "collection/rag-prod-memory"
          ]
        }
      ]
      AllowFromPublic = false
      SourceVPCEs = [
        aws_opensearchserverless_vpc_endpoint.main.id
      ]
    }
  ])
}

# 3. Data Access Policy (IAM)
resource "aws_opensearchserverless_access_policy" "data_access" {
  name        = "rag-data-access"
  type        = "data"
  description = "Allow RAG Lambda and SageMaker roles to read/write"

  policy = jsonencode([
    {
      Rules = [
        {
          ResourceType = "collection"
          Resource = [
            "collection/rag-prod-memory"
          ]
          Permission = [
            "aoss:CreateCollectionItems",
            "aoss:DeleteCollectionItems",
            "aoss:UpdateCollectionItems",
            "aoss:DescribeCollectionItems"
          ]
        },
        {
          ResourceType = "index"
          Resource = [
            "index/rag-prod-memory/*"
          ]
          Permission = [
            "aoss:CreateIndex",
            "aoss:DeleteIndex",
            "aoss:UpdateIndex",
            "aoss:DescribeIndex",
            "aoss:ReadDocument",
            "aoss:WriteDocument"
          ]
        }
      ]
      Principal = [
        aws_iam_role.rag_inference_lambda.arn,
        aws_iam_role.indexing_batch_job.arn,
        data.aws_caller_identity.current.arn # Admin access
      ]
    }
  ])
}

# VPC Endpoint for private access
resource "aws_opensearchserverless_vpc_endpoint" "main" {
  name       = "rag-vpce"
  vpc_id     = var.vpc_id
  subnet_ids = var.private_subnet_ids
  security_group_ids = [
    aws_security_group.opensearch_client_sg.id
  ]
}

Creating the Index (Python)

Once infrastructure is up, you define the index mapping.

from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3

# Auth
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, 'us-east-1', 'aoss')

# Client
client = OpenSearch(
    hosts=[{'host': 'Use-The-Collection-Endpoint.us-east-1.aoss.amazonaws.com', 'port': 443}],
    http_auth=auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

# Define Index
index_name = "corp-knowledge-base-v1"
index_body = {
  "settings": {
    "index": {
      "knn": True,
      "knn.algo_param.ef_search": 100 # Tradeoff: Recall vs Latency
    }
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1536, # E.g., for OpenAI text-embedding-3-small
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "space_type": "cosinesimil", # Cosine Similarity is standard for embeddings
          "parameters": {
            "ef_construction": 128,
            "m": 24 # Max connections per node
          }
        }
      },
      "text_content": { "type": "text" }, # For Keyword search (Hybrid)
      "metadata": {
        "properties": {
          "source": { "type": "keyword" },
          "created_at": { "type": "date" },
          "access_level": { "type": "keyword" }
        }
      }
    }
  }
}

if not client.indices.exists(index_name):
    client.indices.create(index=index_name, body=index_body)
    print(f"Index {index_name} created.")

30.1.3. Architecture: GCP Vertex AI Vector Search

Google’s offering (formerly Matching Engine) is based on ScaNN (Scalable Nearest Neighbors), a proprietary Google Research algorithm that often outperforms HNSW and IVFFlat in benchmarks.

Key Characteristics

High Throughput: Capable of extremely high QPS (Queries Per Second).
Recall/Performance: ScaNN uses anisotropic vector quantization which respects the dot product geometry better than standard K-means quantization.
Architecture: Separate control plane (Index) and data plane (IndexEndpoint).

Infrastructure as Code (Terraform)

# -----------------------------------------------------------------------------
# GCP Vertex AI Vector Search
# -----------------------------------------------------------------------------

resource "google_storage_bucket" "vector_bucket" {
  name     = "gcp-ml-vector-store-${var.project_id}"
  location = "US"
}

# 1. The Index (Logical Definition)
# Note: You generally create indexes via API/SDK in standard MLOps
# because they are immutable/versioned artifacts, but here is the TF resource.
resource "google_vertex_ai_index" "main_index" {
  display_name = "production-knowledge-base"
  description  = "Main RAG index using ScaNN"
  region       = "us-central1"

  metadata {
    contents_delta_uri = "gs://${google_storage_bucket.vector_bucket.name}/indexes/v1"
    config {
      dimensions                  = 768 # E.g., for Gecko embeddings
      approximate_neighbors_count = 150
      distance_measure_type       = "DOT_PRODUCT_DISTANCE"
      algorithm_config {
        tree_ah_config {
          leaf_node_embedding_count    = 500
          leaf_nodes_to_search_percent = 7
        }
      }
    }
  }
  index_update_method = "STREAM_UPDATE" # Enable real-time updates
}

# 2. The Index Endpoint (Serving Infrastructure)
resource "google_vertex_ai_index_endpoint" "main_endpoint" {
  display_name = "rag-endpoint-public"
  region       = "us-central1"
  network      = "projects/${var.project_number}/global/networks/${var.vpc_network}"
}

# 3. Deployment (Deploy Index to Endpoint)
resource "google_vertex_ai_index_endpoint_deployed_index" "deployment" {
  depends_on        = [google_vertex_ai_index.main_index]
  index_endpoint    = google_vertex_ai_index_endpoint.main_endpoint.id
  index            = google_vertex_ai_index.main_index.id
  deployed_index_id = "deployed_v1"
  display_name      = "production-v1"

  dedicated_resources {
    min_replica_count = 2
    max_replica_count = 10
    machine_spec {
      machine_type = "e2-standard-16"
    }
  }
}

ScaNN vs. HNSW

Why choose Vertex/ScaNN?

HNSW: Graph-based. Great per-query latency. Memory intensive (graph structure). Random access patterns (bad for disk).
ScaNN: Quantization-based + Tree search. Higher compression. Google hardware optimization.

30.1.4. RDS pgvector: The “Just Use Postgres” Option

For many teams, introducing a new database (OpenSearch or Weaviate) is operational overhead they don’t want. pgvector is an extension for PostgreSQL that enables vector similarity search.

Why pgvector?

Transactional: ACID compliance for your vectors.
Joins: Join standard SQL columns with vector search results in one query.
Familiarity: It’s just Postgres.

Infrastructure (Terraform)

resource "aws_db_instance" "postgres" {
  identifier        = "rag-postgres-db"
  engine            = "postgres"
  engine_version    = "15.3"
  instance_class    = "db.r6g.xlarge" # Memory optimized for vectors
  allocated_storage = 100
  
  # Ensure you install the extension
  # Note: You'll typically do this in a migration script, not Terraform
}

SQL Implementation

-- 1. Enable Extension
CREATE EXTENSION IF NOT EXISTS vector;

-- 2. Create Table
CREATE TABLE documents (
  id bigserial PRIMARY KEY,
  content text,
  metadata jsonb,
  embedding vector(1536) -- OpenAI dimension
);

-- 3. Create HNSW Index (Vital for performance!)
-- ivfflat is simpler but hnsw is generally preferred for recall/performance
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- 4. Query (KNN)
SELECT content, metadata, 1 - (embedding <=> '[...vector...]') as similarity
FROM documents
ORDER BY embedding <=> '[...vector...]' -- <=> is cosine distance operator
LIMIT 5;

-- 5. Hybrid Query (SQL + Vector)
SELECT content
FROM documents
WHERE metadata->>'category' = 'finance' -- SQL Filter
ORDER BY embedding <=> '[...vector...]' 
LIMIT 5;

30.1.5. Deep Dive: Indexing Algorithms and Tuning

The choice of index algorithm dictates the “Recall vs. Latency vs. Memory” triangle. Understanding the internals of these algorithms is mandatory for tuning production systems.

1. Inverted File Index (IVF-Flat)

IVF allows you to speed up search by clustering the vector space and only searching a subset.

Mechanism:
1. Training: Run K-Means on a sample of data to find $C$ centroids (where nlist = $C$).
2. Indexing: Assign every vector in the dataset to its nearest centroid.
3. Querying: Find the closest nprobe centroids to the query vector. Search only the vectors in those specific buckets.
Parameters:
- nlist: Number of clusters. Recommendation: $4 \times \sqrt{N}$ (where $N$ is total vectors).
- nprobe: Number of buckets to search.
  - nprobe = 1: Fast, low recall. (Only search the absolute closest bucket).
  - nprobe = nlist: Slow, perfect recall (Brute force).
  - Sweet spot: Typically 1-5% of nlist.

2. Product Quantization (PQ) with IVF (IVF-PQ)

IVF reduces the search scope, but PQ reduces the memory footprint.

Mechanism:
- Split the high-dimensional vector (e.g., 1024 dims) into $M$ sub-vectors (e.g., 8 sub-vectors of 128 dims).
- Run K-means on each subspace to create a codebook.
- Replace the float32 values with the centroid ID (usually 1 byte).
- Result: Massive compression (e.g., 32x to 64x).
Trade-off: PQ introduces loss. Distances are approximated. You might miss the true nearest neighbor because the vector was compressed.
Refinement: Often used with a “Re-ranking” step where you load the full float32 vectors for just the top-k candidates to correct the order.

3. Hierarchical Navigable Small World (HNSW)

HNSW is the industry standard for in-memory vector search because it offers logarithmic complexity $O(\log N)$ with high recall.

Graph Structure:
- It’s a multi-layered graph (a Skip List for graphs).
- Layer 0: Contains all data points (dense).
- Layer K: Contains a sparse subset of points serving as “expressways”.
Search Process:
1. Enter at the top layer.
2. Greedily traverse to the nearest neighbor in that layer.
3. “Descend” to the next layer down, using that node as the entry point.
4. Repeat until Layer 0.
Tuning M (Max Connections):
- Controls memory usage and recall.
- Range: 4 to 64.
- Higher M = Better Recall, robust against “islands” in the graph, but higher RAM usage per vector.
Tuning ef_construction:
- Size of the dynamic candidate list during index build.
- Higher = Better quality graph (fewer disconnected components), significantly slower indexing.
- Rule of Thumb: ef_construction $\approx 2 \times M$.
Tuning ef_search:
- Size of the candidate list during query.
- Higher = Better Recall, Higher Latency.
- Dynamic Tuning: You can change ef_search at runtime without rebuilding the index! This is your knob for “High Precision Mode” vs “High Speed Mode”.

4. DiskANN (Vamana Graph)

As vector datasets grow to 1 billion+ (e.g., embedding every paragraph of a corporate SharePoint history), RAM becomes the bottleneck. HNSW requires all nodes in memory.

DiskANN solves this by leveraging modern NVMe SSD speeds.

Vamana Graph: A graph structure designed to minimize the number of hops (disk reads) to find a neighbor.
Mechanism:
1. Keep a compressed representation (PQ) in RAM for fast navigation.
2. Keep full vectors on NVMe SSD.
3. During search, use RAM to narrow down candidates.
4. Fetch full vectors from disk only for final distance verification.
Cost: Store 1B vectors on $200 of SSD instead of $5000 of RAM.

30.1.6. Capacity Planning and Sizing Guide

Sizing a vector cluster is more complex than a standard DB because vectors are computationally heavy (distance calculations) and memory heavy.

1. Storage Calculation

Vectors are dense float arrays. $$ Size_{GB} = \frac{N \times D \times 4}{1024^3} $$

$N$: Number of vectors.
$D$: Dimensions.
$4$: Bytes per float32.

Overhead:

HNSW: Adds overhead for storing graph edges. Add ~10-20% for links.
Metadata: Don’t forget the JSON metadata stored with vectors! Often larger than the vector itself.

Example:

100M Vectors.
OpenAI text-embedding-3-small (1536 dims).
1KB Metadata per doc.
Vector Size: $100,000,000 \times 1536 \times 4 \text{ bytes} \approx 614 \text{ GB}$.
Metadata Size: $100,000,000 \times 1 \text{ KB} \approx 100 \text{ GB}$.
Index Overhead (HNSW): ~100 GB.
Total: ~814 GB of RAM (if using HNSW) or Disk (if using DiskANN).

2. Compute Calculation (QPS)

QPS depends on ef_search and CPU cores.

Recall vs Latency Curve:
- For 95% Recall, you might get 1000 QPS.
- For 99% Recall, you might drop to 200 QPS.
Sharding:
- Vector search is easily parallelizable.
- Throughput Sharding: Replicate the entire index to multiple nodes. Load balance queries.
- Data Sharding: Split the index into 4 parts. Query all 4 in parallel, merge results (Map-Reduce). Necessary when index > RAM.

30.1.7. Production Challenges & Anti-Patterns

1. The “Delete” Problem

HNSW graphs are hard to modify. Deleting a node leaves a “hole” in the graph connectivity.

Standard Implementation: “Soft Delete” (mark as deleted).
Consequence: Over time, the graph quality degrades, and the “deleted” nodes still consume RAM and are processed during search (just filtered out at the end).
Fix: Periodic “Force Merge” or “Re-index” operations are required to clean up garbage. Treat vector indexes as ephemeral artifacts that are rebuilt nightly/weekly.

2. The “Update” Problem

Updating a vector (re-embedding a document) is effectively a Delete + Insert.

Impact: High write churn kills read latency in HNSW.
Architecture: Separate Read/Write paths.
- Lambda Architecture:
  - Batch Layer: Rebuild absolute index every night.
  - Speed Layer: Small in-memory index for today’s data.
  - Query: Search both, merge results.

3. Dimensionality Curse

Higher dimensions = Better semantic capture? Not always.

Going from 768 (BERT) to 1536 (OpenAI) doubles memory and halves speed.
MRL (Matryoshka): See Chapter 30.2. Use dynamic shortening to save cost.

30.1.8. Security: Infrastructure as Code for Multi-Tenant Vector Stores

If you are building a RAG platform for multiple internal teams (HR, Engineering, Legal), you must segregate data.

Strategy 1: Index-per-Tenant

Pros: Hard isolation. Easy to delete tenant data.
Cons: Resource waste (overhead per index).

Strategy 2: Filter-based Segregation

All vectors in one big index, with a tenant_id field.

Pros: Efficient resource usage.
Cons: One bug in filter logic leaks Legal data to Engineering.

Terraform for Secure OpenSearch

Implementing granular IAM for index-level access.

# IAM Policy for restricting access to specific indices
resource "aws_iam_policy" "hr_only_policy" {
  name        = "rag-hr-data-access"
  description = "Access only HR indices"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["aoss:APIAccessAll"]
        Resource = ["arn:aws:aoss:us-east-1:123456789012:collection/rag-prod"]
        Condition = {
           "StringEquals": {
              "aoss:index": "hr-*"
           }
        }
      }
    ]
  })
}

30.1.9. Benchmarking Framework

Never trust vendor benchmarks (“1 Million QPS!”). Run your own with your specific data distribution and vector dimension.

VectorDBBench

A popular open-source tool for comparing vector DBs.

pip install vectordb-bench

# Run a standard benchmark
vectordb-bench run \
    --db opensearch \
    --dataset gist-960-euclidean \
    --test_cases performance \
    --output_dir ./results

Key Metrics to Measure

QPS at 99% Recall: The only metric that matters. High QPS at 50% recall is useless.
P99 Latency: RAG is a chain; high tail latency breaks the UX.
Indexing Speed: How long to ingest 10M docs? (Critical for disaster recovery).
TCO per Million Vectors: Hardware cost + license cost.

30.1.10. Detailed Comparison Matrix

Feature	AWS OpenSearch Serverless	Vertex AI Vector Search	pgvector (RDS)	Pinecone (Serverless)
Core Algo	HNSW (NMSLIB)	ScaNN	HNSW / IVFFlat	Proprietary Graph
Engine	Lucene-based	Google Research	Postgres Extension	Proprietary
Storage Tier	S3 (decoupled)	GCS	EBS (coupled)	S3 (decoupled)
Upsert Speed	Moderate (~seconds)	Fast (streaming)	Fast (transactional)	Fast
Cold Start	Yes (OCU spinup)	No (Always on)	No	Yes
Hybrid Search	Native (Keyword+Vector)	Limited (mostly vector)	Native (SQL+Vector)	Native (Sparse-Dense)
Metadata Filter	Efficient	Efficient	Very Efficient	Efficient
Cost Model	Per OCU-hour	Per Node-hour	Instance Size	Usage-based

Decision Guide

Choose AWS OpenSearch if: You are already deep in AWS, need FIPS compliance, and want “Serverless” scaling.
Choose Vertex AI if: You have massive scale (>100M), strict latency budgets (<10ms), and Google-level recall needs.
Choose pgvector if: You have <10M vectors, need ACID transactions, want to keep stack simple (one DB).
Choose Pinecone if: You want zero infrastructure management and best-in-class developer experience.

30.1.12. Integration with Feature Stores

In a mature MLOps stack, the Vector Database does not live in isolation. It often effectively acts as a “candidate generator” that feeds into a more complex ranking system powered by a Feature Store.

The “ Retrieve -> Enrich -> Rank“ Pattern

Retrieve (Vector DB): Get top 100 items suitable for the user (based on embedding similarity).
Enrich (Feature Store): Fetch real-time features for those 100 items (e.g., “click_count_last_hour”, “stock_status”, “price”).
Rank (XGBoost/LLM): Re-score the items based on the fresh feature data.

Why not store everything in the Vector DB?

Vector DBs are eventually consistent and optimized for immutable data. They are terrible at high-velocity updates (like “view count”).

Vector DB: Stores Description embedding (Static).
Feature Store (Redis/Feast): Stores Price, Inventory, Popularity (Dynamic).

Code: Feast + Qdrant Integration

from feast import FeatureStore
from qdrant_client import QdrantClient

# 1. Retrieve Candidates (Vector DB)
q_client = QdrantClient("localhost")
hits = q_client.search(
    collection_name="products",
    query_vector=user_embedding,
    limit=100
)
product_ids = [hit.payload['product_id'] for hit in hits]

# 2. Enrich (Feast)
store = FeatureStore(repo_path=".")
feature_vector = store.get_online_features(
    features=[
        "product_stats:view_count_1h",
        "product_stats:conversion_rate_24h",
        "product_stock:is_available"
    ],
    entity_rows=[{"product_id": pid} for pid in product_ids]
).to_dict()

# 3. Rank (Custom Logic)
ranked_products = []
for pid, views, conv, avail in zip(product_ids, feature_vector['view_count_1h'], ...):
    if not avail: continue # Filter OOS
    
    score = (views * 0.1) + (conv * 50) # Simple heuristic
    ranked_products.append((pid, score))

ranked_products.sort(key=lambda x: x[1], reverse=True)

30.1.13. Multimodal RAG: Beyond Text

RAG is no longer just for text. Multimodal RAG allows searching across images, audio, and video using models like CLIP (Contrastive Language-Image Pre-Training).

Architecture

Embedding Model: CLIP (OpenAI) or SigLIP (Google). Maps Image and Text to the same vector space.
Storage:
- Vector DB: Stores the embedding.
- Object Store (S3): Stores the actual JPEG/PNG.
- Metadata: Stores the S3 URI (s3://bucket/photo.jpg).

CLIP Search Implementation

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# 1. Indexing an Image
image = Image.open("dog.jpg")
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
vector_db.add(id="dog_1", vector=image_features.detach().numpy())

# 2. Querying with Text ("Find me images of dogs")
text_inputs = processor(text=["a photo of a dog"], return_tensors="pt")
text_features = model.get_text_features(**text_inputs)
results = vector_db.search(text_features.detach().numpy())

# 3. Querying with Image ("Find images like this one")
# Just use get_image_features() on the query image and search.

Challenges

Storage Cost: Images are large. Don’t store Base64 blobs in the Vector DB payload. It kills performance.
Latency: CLIP inference is heavier than BERT.

Vector Databases are databases. They contain PII. You must be able to delete data from them.

The “Deletion” Nightmare

As discussed in the Indexing section, HNSW graphs hate deletions. However, GDPR Article 17 requires “Right to Erasure” within 30 days.

Strategies

Partitioning by User: If you have a B2C app, create a separate index (or partition) per user. Deleting a user = Dropping the partition.
- Feasible for: “Chat with my Data” apps.
- Impossible for: Application-wide search.
Crypto-Shredding:
- Encrypt the metadata payload with a per-user key.
- Store the key in a separate KMS.
- To “delete” the user, destroy the key. The data is now garbage.
- Note: This doesn’t remove the vector from the graph, so the user might still appear in search results (as a generic blob), but the content is unreadable.
The “Blacklist” Filter:
- Maintain a Redis set of deleted_doc_ids.
- Apply this as a mandatory excludes-filter on every query.
- Rebuild the index monthly to permanently purge the data.

30.1.15. Cost Analysis: Build vs. Buy

The most common question from leadership: “Why does this cost $5,000/month?”

Scenario: 50 Million Vectors (Enterprise Scale)

Dimensions: 1536 (OpenAI).
Traffic: 100 QPS.

Option A: Managed (Pinecone / Weaviate Cloud)

Pricing: Usage-based (Storage + Read Units + Write Units).
Storage: ~$1000/month (for pod-based systems).
Compute: Usage based.
Total: ~$1,500 - $3,000 / month.
Ops Effort: Near Zero.

Option B: Self-Hosted (AWS OpenSearch Managed)

Data Nodes: 3x r6g.2xlarge (64GB RAM each).
- Cost: $0.26 * 24 * 30 * 3 = $561.
Master Nodes: 3x m6g.large.
- Cost: $0.08 * 24 * 30 * 3 = $172.
Storage (EBS): 1TB gp3.
- Cost: ~$100.
Total: ~$833 / month.
Ops Effort: Medium (Upgrades, resizing, dashboards).

Option C: DIY (EC2 + Qdrant/Milvus)

Nodes: 3x Spot Instances r6g.2xlarge.
- Cost: ~$200 / month (Spot pricing).
Total: ~$200 - $300 / month.
Ops Effort: High (Kubernetes, HA, Spot interruption handling).

Verdict: Unless you are Pinterest or Uber, Managed or Cloud Native (OpenSearch) is usually the right answer. The engineering time spent fixing a corrupted HNSW graph on a Saturday is worth more than the $1000 savings.

30.1.17. Case Study: Migrating to Billion-Scale RAG at “FinTechCorp”

Scaling from a POC (1 million docs) to Enterprise Search (1 billion docs) breaks almost every assumption you made in the beginning.

The Challenge

FinTechCorp had 20 years of PDF financial reports.

Volume: 500 Million pages.
Current Stack: Elasticsearch (Keyword).
Goal: “Chat with your Documents” for 5,000 analysts.

Phase 1: The POC (Chroma)

Setup: Single Python server running ChromaDB.
Result: Great success on 100k docs. Analysts loved the semantic search.
Failure: When they loaded 10M docs, the server crashed with OOM (Out of Memory). HNSW requires RAM.

Phase 2: The Scale-Out (OpenSearch + DiskANN)

Decision: They couldn’t afford 5TB of RAM to hold the HNSW graph.
Move: Switched to OpenSearch Service with nmslib.
Optimization:
- Quantization: Used byte-quantized vectors (8x memory reduction) at the cost of slight precision loss.
- Sharding: Split the index into 20 shards across 6 data nodes.

Phase 3: The Ingestion Bottleneck

Problem: Re-indexing took 3 weeks.
Fix: Built a Spark job on EMR to generate embeddings in parallel (1000 node cluster) and bulk-load into OpenSearch.

Outcome

Latency: 120ms (P99).
Recall: 96% compared to brute force.
Cost: $4,500/month (Managed instances + Storage).

30.1.18. War Story: The “NaN” Embedding Disaster

“Production is down. Search is returning random results. It thinks ‘Apple’ is similar to ‘Microscope’.”

The Incident

On a Tuesday afternoon, the accuracy of the RAG system plummeted to zero. Users searching for “Quarterly Results” got documents about “Fire Safety Procedures.”

The Investigation

Logs: No errors. 200 OK everywhere.
Debug: We inspected the vectors. We found that 0.1% of vectors contained NaN (Not a Number).
Root Cause: The embedding model (BERT) had a bug where certain Unicode characters (emoji + Zalgo text) caused a division by zero in the LayerNorm layer.
Propagation: Because HNSW uses distance calculations, one NaN in the graph “poisoned” the distance metrics for its neighbors during the index build, effectively corrupting the entire graph structure.

The Fix

Validation: Added a schema check in the ingestion pipeline: assert not np.isnan(vector).any().
Sanitization: Stripped non-printable characters before embedding.
Rebuild: Had to rebuild the entire 50M vector index from scratch (took 24 hours).

Lesson: Never trust the output of a neural network. Always validate mathematical properties (Norm length, NaN checks) before indexing.

30.1.19. Interview Questions

If you are interviewing for an MLOps role focusing on Search/RAG, expect these questions.

Q1: What is the difference between HNSW and IVF?

Answer: HNSW is a graph-based algorithm. It allows logarithmic traversal but consumes high memory because it stores edges. It generally has better recall. IVF is a clustering-based algorithm. It partitions the space into Voronoi cells. It is faster to train and uses less memory (especially with PQ), but recall can suffer at partition boundaries.

Q2: How do you handle metadata filtering in Vector Search?

Answer: Explain the difference between Post-filtering (bad recall) and Pre-filtering (slow). Mention “Filtered ANN” where the index traversal skips nodes that don’t match the bitmask.

Q3: What is the “Curse of Dimensionality” in vector search?

Answer: As dimensions increase, the distance between the nearest and farthest points becomes negligible, making “similarity” meaningless. Also, computational cost scales linearly with $D$. Dimensionality reduction (PCA or Matryoshka) helps.

Q4: How would you scale a vector DB to 100 Billion vectors?

Answer: RAM is the bottleneck. I would use:
1. Disk-based Indexing (DiskANN/Vamana) to store vectors on NVMe.
2. Product Quantization (PQ) to compress vectors by 64x.
3. Sharding: Horizontal scaling across hundreds of nodes.
4. Tiered Storage: Hot data in RAM/HNSW, cold data in S3/Faiss-Flat.

30.1.20. Summary

The vector database is the hippocampus of the AI application.

Don’t over-engineer: Start with pgvector or Chroma for prototypes.
Plan for scale: Move to OpenSearch or Vertex when you hit 10M vectors.
Tune your HNSW: Default settings are rarely optimal. Use the formula.
Capacity Plan: Vectors are RAM-hungry. Calculate costs early.
Monitor Recall: Latency is easy to measure; recall degradation is silent. Periodically test against a brute-force ground truth.
Respect Compliance: Have a “Delete” button that actually works.
Validate Inputs: Beware of NaN vectors!

Keyboard shortcuts

The MLOps Omni-Reference