39.3. Real-Time Retrieval (Candidate Generation)

Status: Draft Version: 1.0.0 Tags: #RecSys, #ANN, #Rust, #Vectors, #Milvus Author: MLOps Team

The Retrieval Funnel
The Two-Tower Architecture
Training: Negative Sampling Strategies
Approximate Nearest Neighbors (ANN)
Deep Dive: HNSW Graph Traversal
Rust Implementation: Vector Search Service
Infrastructure: Deploying Milvus
Consistency: The “Index Drift” Problem
Quantization: Speed vs Precision
Troubleshooting: Deployment Issues
MLOps Interview Questions
Glossary
Summary Checklist

Prerequisites

Before diving into this chapter, ensure you have the following installed:

Rust: 1.70+ (ndarray, rayon crates)
Docker: 20.10+ (for Milvus)
PyTorch: For defining Two-Tower models.

The Retrieval Funnel

You have 100 Million items. The User requests recommendations. You cannot run your heavy Ranker (XGBoost/Transformer) on 100M items. It takes 10ms per item. 10ms * 100M = 11 days.

We use a Funnel Architecture:

[ All Items (100,000,000) ]
       |
       |  (Stage 1: Retrieval / Candidate Generation)
       |  Methods: ANN, Matrix Factorization, Graph Traversal
       |  Latency: 10ms
       v
[ Candidates (1,000) ]
       |
       |  (Stage 2: Scoring / Ranking)
       |  Methods: GBM, Deep Crossing
       |  Latency: 50ms
       v
[ Top-K (10) ]

Retrieval must be:

Fast: sub-millisecond per item.
High Recall: Don’t miss the user’s favorite item. Precision doesn’t matter much (the Ranker fixes it).
Low Cost: Must fit in RAM or use Disk-ANN.

The Two-Tower Architecture

The standard model for Retrieval is the Two-Tower (Dual Encoder) model.

User Tower: $f(u) \rightarrow \mathbb{R}^d$
Item Tower: $g(i) \rightarrow \mathbb{R}^d$
Score: $s(u,i) = \langle f(u), g(i) \rangle$ (Dot Product)

Because the score is a Dot Product, we can precompute all Item vectors $g(i)$ and store them in an ANN index (FAISS/HNSW). At runtime, we only compute $f(u)$ once, then query the ANN.

Math: The Objective Function

We maximize the similarity of positive pairs $(u, i^+)$ while minimizing negatives $(u, i^-)$. Using InfoNCE (Contrastive Loss):

$$ L = -\frac{1}{B} \sum_{k=1}^B \log \frac{exp(\langle u_k, i_k^+ \rangle / \tau)}{\sum_{j=1}^B exp(\langle u_k, i_j \rangle / \tau)} $$

Where $\tau$ is the temperature parameter (controlling the sharpness of the distribution).

Training: Negative Sampling Strategies

How do we train the towers? We need Pasitive Pairs and Negative Pairs. But the dataset only contains Positives (clicks). We must Sample Negatives.

1. Random Negatives

Pick a random item $j$ from catalog.

Pros: Easy, Efficient (In-Batch).
Cons: Too easy. The model learns “Popular Item vs Random Trash”, not “Popular vs High-Quality Niche”.

2. In-Batch Negatives

For a batch of $B$ pairs $(u_k, i_k)$, treat $u_k$ with $i_j$ (where $k \neq j$) as negatives.

Efficiency: We reuse the embeddings computed in the batch.
Correction: In-batch sampling is biased towards popular items (they appear more often in batches). Correct the logits: $$ s(u, i) \leftarrow s(u, i) - \log P(i) $$

3. Hard Negatives (Mining)

Items that the user almost clicked (e.g., impressed but not clicked), or items with high dot product that are actually irrelevant.

Strategy: Periodically run the model, find the “False Positives”, and add them to the next training set.

Approximate Nearest Neighbors (ANN)

Exact Search ($O(N)$) is too slow. We use ANN ($O(\log N)$).

Algorithms

HNSW (Hierarchical Navigable Small World): Graph-based. Best performance/recall trade-off. Memory hungry.
IVF-PQ (Inverted File with Product Quantization): Clustering + Compression. Low memory.
LSH (Locality Sensitive Hashing): Random projections. Poor recall compared to HNSW.

Deep Dive: HNSW Graph Traversal

HNSW works like a Skip List for Graphs. It builds a hierarchy of layers.

Layer 3:  [Node A] ---------------------------> [Node Z]
             |                                     |
Layer 2:  [Node A] ------> [Node M] ----------> [Node Z]
             |                |                    |
Layer 1:  [Node A] -> [B] -> [M] -> [P] -> [X] -> [Z]

Search Process:

Start at top layer (sparse). Move greedily towards Query $Q$.
When local minimum reached, drop to lower layer.
Continue greedy search with finer granularity.

This guarantees $O(\log N)$ scaling.

Rust Implementation: Vector Search Service

We implement a simple in-memory vector searcher. For production, wrap faiss-rs or use Qdrant. Here, we optimize the Dot Product using SIMD logic (via ndarray).

Project Structure

vector-search/
├── Cargo.toml
└── src/
    └── lib.rs

Cargo.toml:

[package]
name = "vector-search"
version = "0.1.0"
edition = "2021"

[dependencies]
ndarray = "0.15"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
rayon = "1.7" # Parallelism

src/lib.rs:

#![allow(unused)]
fn main() {
//! In-Memory Vector Search Engine.
//! Demonstrates exact search with SIMD optimizations and basic parallelization.

use ndarray::{Array1, Array2, Axis};
use rayon::prelude::*;
use std::cmp::Ordering;

#[derive(Debug, Clone)]
pub struct SearchResult {
    pub id: usize,
    pub score: f32,
}

pub struct VectorIndex {
    // Dense Matrix of shape (N_ITEMS, DIM)
    // Contiguous memory layout allows standard BLAS optimization.
    vectors: Array2<f32>,
    ids: Vec<usize>,
}

impl VectorIndex {
    pub fn new(dim: usize, capacity: usize) -> Self {
        Self {
            vectors: Array2::zeros((0, dim)),
            ids: Vec::with_capacity(capacity),
        }
    }

    /// Add a vector to the index.
    /// WARN: This trigger re-allocation if capacity is exceeded.
    pub fn add(&mut self, id: usize, vector: Array1<f32>) {
        // In real impl, handle resizing or use a better structure
        // This is simplified append
        if self.vectors.shape()[0] == 0 {
             self.vectors = vector.insert_axis(Axis(0));
        } else {
             self.vectors.push(Axis(0), vector.view()).unwrap();
        }
        self.ids.push(id);
    }

    /// Brute Force Search (Exact)
    /// Optimized with BLAS/SIMD by ndarray's dot product.
    /// Complexity: O(N * D)
    pub fn search(&self, query: &Array1<f32>, k: usize) -> Vec<SearchResult> {
        // Dot Product: (N, D) x (D, 1) -> (N, 1)
        // This single line is heavily optimized by OpenBLAS/MKL if linked.
        let scores = self.vectors.dot(query);
        
        // Argpartition / TopK
        // Rust doesn't have partial_sort in std easily, we collect and sort
        let mut scored_items: Vec<SearchResult> = scores
            .iter()
            .enumerate()
            .map(|(idx, &score)| SearchResult {
                id: self.ids[idx],
                score,
            })
            .collect();

        // Sort Descending by Score
        // Use partial_cmp because floats do not implement Ord (NaN handling)
        scored_items.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap_or(Ordering::Equal));
        
        scored_items.into_iter().take(k).collect()
    }
    
    /// Simulated HNSW Greedy Step (Conceptual)
    /// Moves from current node to neighbor with highest score.
    /// This is the core primitive of HNSW traversal.
    pub fn greedy_step(&self, query: &Array1<f32>, current_idx: usize, neighbors: &[usize]) -> usize {
        let mut best_idx = current_idx;
        let mut best_score = self.vectors.row(current_idx).dot(query);
        
        for &neighbor_idx in neighbors {
            let score = self.vectors.row(neighbor_idx).dot(query);
            if score > best_score {
                best_score = score;
                best_idx = neighbor_idx;
            }
        }
        best_idx
    }
}

#[cfg(test)]
mod tests {
    use super::*;
    use ndarray::arr1;

    #[test]
    fn test_search() {
        let mut index = VectorIndex::new(3, 10);
        index.add(1, arr1(&[1.0, 0.0, 0.0])); // X axis
        index.add(2, arr1(&[0.0, 1.0, 0.0])); // Y axis
        index.add(3, arr1(&[0.0, 0.0, 1.0])); // Z axis
        
        let query = arr1(&[0.9, 0.1, 0.0]); // Close to X
        
        let results = index.search(&query, 2);
        
        assert_eq!(results[0].id, 1);
        assert!(results[0].score > 0.8);
    }
}
}

Infrastructure: Deploying Milvus

For production, we use Milvus or Qdrant. Here is the Kubernetes manifests.

docker-compose.yml:

version: '3.5'
services:
  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.0
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls=http://0.0.0.0:2379
  
  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2020-12-03T00-03-10Z
    command: server /data
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin

  milvus:
    container_name: milvus-standalone
    image: milvusdb/milvus:v2.0.0
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - /var/lib/milvus:/var/lib/milvus
    ports:
      - "19530:19530"

Consistency: The “Index Drift” Problem

Your User Tower and Item Tower interact via Dot Product. If you update the User Tower (new deployment) but do not re-index the Item Tower, the dot products are meaningless. The spaces are Misaligned.

The Golden Rule of Embeddings

“You MUST version the encoders and the index together.”

Bad:
- Deploy Model V2.
- User Service uses V2.
- Vector DB contains V1 vectors.
- Result: Random recommendations.
Good (Blue/Green Indexing):
1. Train Model V2.
2. Batch Inference: Compute V2 vectors for all 100M items.
3. Build Index V2 (Takes 4 hours).
4. Deploy Model V2 Service configured to query Index V2.
5. Atomic Swap.

Quantization: Speed vs Precision

We can compress vectors from float32 (4 bytes) to int8 (1 byte). This allows 4x more vectors in RAM.

Product Quantization (PQ): Split vector into sub-vectors and cluster them.

Recall Loss: Typically < 5% loss.
Speed Gain: 10x faster distance calculation (Lookup tables).
Recommendation: Always use PQ for datasets > 10M items.

Troubleshooting: Deployment Issues

Scenario 1: Sudden Recall Drop

Symptom: Recall@100 drops from 95% to 50% after deployment.
Cause: “Index Drift” (see above). You deployed a new User Model but are querying against an Old Item Index.
Fix: Rollback User Model immediately. Wait for Index rebuild.

Scenario 2: High Latency (p99 > 1s)

Symptom: Search takes too long.
Cause: You are running brute force search on 1M items without an Index. Or ef_search strategy in HNSW is too high (checking too many nodes).
Fix: Tune HNSW parameters (M, ef_construction). Use Quantization.

Scenario 3: Memory OOM

Symptom: Vector DB crashes.
Cause: Vectors are loaded in RAM. 100M * 768 * 4 bytes = 300GB of RAM.
Fix: Switch to DiskANN (Store vectors on NVMe, only graph in RAM) or use IVFPQ quantization.

MLOps Interview Questions

Q: What is the “Dot Product Bottleneck”? A: The Two-Tower model restricts the interaction between User and Item to a simple dot product (or sum/concat). It cannot capture complex interactions like “User likes Sci-Fi ONLY if it is also Comedy”. This is why we need a Ranker (Cross-Encoder) afterwards.
Q: How do you handle real-time updates to the Vector DB? A: HNSW supports dynamic insertions, but the graph degrades. You typically need a “Periodic Re-indexing” job (e.g., daily) to compact and optimize the graph, while handling new items in a smaller, unoptimized buffer.
Q: Why not just use Cosine Similarity? A: Cosine Similarity is Dot Product on normalized vectors. We often normalize embeddings to unit length during training so that Dot Product == Cosine Similarity. Unnormalized vectors can cause popularity bias (Vector Norm = Popularity).
Q: Explain “Hard Negative Mining”. A: Finding negatives that are difficult for the current model to distinguish from positives. We score random items, pick the ones with highest scores (False Positives), and add them to the next training batch as negatives.
Q: What is “Quantization” in ANN? A: Reducing float32 (4 bytes) to int8 (1 byte) or Product Quantization (PQ) to compress vectors. It reduces memory usage by 4x-64x at the cost of slight precision loss.

Glossary

ANN (Approximate Nearest Neighbors): Algorithms to find similar vectors sub-linearly.
Two-Tower Model: Architecture separating User and Item processing until the final dot product.
HNSW: Graph-based ANN algorithm.
Recall@K: Percentage of relevant items found in the top K results.
Negative Sampling: The process of selecting “non-interacted” items for supervision.
InfoNCE: Categorical Cross Entropy Loss often used in Contrastive Learning.

Summary Checklist

Recall Metric: Monitor Recall@100 for the Retrieval / Candidate Generation stage.
Latency Budget: Ensure Retrieval takes < 20% of total request budget.
Index Versioning: Automate the re-indexing pipeline. Never let Index V1 meet/serve Model V2.
Fallback: If ANN fails, have a “Popular Items” fallback list.
Filtering: Apply business logic filters (Out of Stock, Region) after Retrieval or using “Filtered ANN” (if supported by DB).
Normalization: Normalize vectors Use L2-norm to prevent magnitude issues.
Negative Sampling: Implement In-Batch negatives with frequency correction.
Memory Planning: Calculate RAM usage. (100M items * 128 dim * 4 bytes = 51 GB). Use Quantization if needed.
Sharding: If Index > RAM, shard by UserHash or Region.
Update Latency: How long does it take for a new item to appear in the Index? (Target: < 1 min).

The MLOps Omni-Reference