Chapter 30.2: Hybrid Search Patterns

“Vectors are great for concepts, but terrible for part numbers. If a user searches for ‘Error 504 on host app-09’, a vector search might return ‘Network Timeout on server web-01’, which is semantically similar but factually useless. Keyword search is undefeated for exact matches.” — Search Engineering Principles

30.2.1. The Limits of Dense Retrieval

Vector search (Dense Retrieval) works by mapping queries and documents to a high-dimensional semantic space.

Query: “How do I reset my password?”
Match: “Account credential recovery process” (Semantic match, no shared words).

This is magic. But it fails in specific, critical RAG scenarios:

Exact Match: Product SKUs, Error Codes, Acronyms (“API-902”).
Out-of-Vocabulary Terms: Proper nouns or internal jargon the embedding model never saw during training.
Negation: “Show me laptops that are NOT Apple.” Vectors struggle heavily with negation.

The Solution: Hybrid Search

Hybrid search combines the best of both worlds:

Dense Retrieval (KNN): Understanding intent and meaning.
Sparse Retrieval (BM25/TF-IDF): Matching precise keywords.

30.2.2. Architecture: RRF (Reciprocal Rank Fusion)

How do you combine a list of results from a vector search and a keyword search? They have different scoring scales.

Vector Score (Cosine Similarity): 0.0 to 1.0.
BM25 Score: 0 to $\infty$ (unbounded).

You cannot simply add Vector_Score + BM25_Score.

Reciprocal Rank Fusion (RRF) is the industry standard algorithm for merging ranked lists without needing to normalize the scores. It relies only on the rank position.

The RRF Formula

$$ RRF_score(d) = \sum_{r \in R} \frac{1}{k + rank(d, r)} $$ Where:

$d$: Document
$R$: Set of rank lists (e.g., Vector List, Keyword List)
$k$: Constant (usually 60) to mitigate the impact of high rankings.

Python Implementation of RRF

from collections import defaultdict

def reciprocal_rank_fusion(
    vector_results: list[str], 
    keyword_results: list[str], 
    k: int = 60
) -> list[tuple[str, float]]:
    """
    Fuses two ranked lists using RRF.
    Results are lists of document IDs, ordered by score (highest first).
    """
    fused_scores = defaultdict(float)

    # Process Vector Results
    for rank, doc_id in enumerate(vector_results):
        fused_scores[doc_id] += 1 / (k + rank + 1)

    # Process Keyword Results
    for rank, doc_id in enumerate(keyword_results):
        fused_scores[doc_id] += 1 / (k + rank + 1)

    # Sort by fused score descending
    sorted_results = sorted(
        fused_scores.items(), 
        key=lambda x: x[1], 
        reverse=True
    )
    
    return sorted_results

# Example Usage
docs_vector = ["doc_A", "doc_B", "doc_C", "doc_D"]
docs_keyword = ["doc_C", "doc_A", "doc_E", "doc_B"]

final_ranking = reciprocal_rank_fusion(docs_vector, docs_keyword)
print(final_ranking)
# Output might prioritize doc_A and doc_C as they appear in both.

30.2.3. The Two-Stage Retrieval Pattern

In production RAG systems, we rarely just “Search and Feed to LLM.” We use a Retrieve-Then-Rerank architecture.

Stage 1: Retrieval (High Recall)

Goal: Get all potentially relevant documents quickly.

Method: Hybrid Search (Vector + Keyword).
Count: Retrieve top 50-100 documents.
Speed: < 50ms.

Stage 2: Reranking (High Precision)

Goal: Sort the top 100 to find the absolute best 5 for the LLM context window.

Method: Cross-Encoder Model.
Count: Return top 5-10.
Speed: ~200-500ms (slower, computationally expensive).

Bi-Encoders vs. Cross-Encoders

Feature	Bi-Encoder (Embeddings)	Cross-Encoder (Reranker)
Architecture	Siamese Network. Encodes query and doc separately.	Single Transformer. Encodes query and doc together.
Input	`bert(Query)` vs `bert(Doc)`	`bert([CLS] Query [SEP] Doc)`
Mechanism	Cosine Similarity of vectors.	Full Self-Attention between Query and Doc tokens.
Accuracy	Good.	Excellent (captures nuance/interaction).
Speed	Fast (0.1ms search).	Slow (requires inference per pair).
Role	Stage 1 (Retrieval)	Stage 2 (Reranking)

Implementation: SentenceTransformers Cross-Encoder

from sentence_transformers import CrossEncoder

# Load a reranker model (e.g., MS MARCO trained)
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

query = "How do I fix a connection timeout?"
documents = [
    "Check your internet cable.",
    "The 504 Gateway Timeout error means...",
    "To bake a cake, you need flour.", # Irrelevant
    "Connection timeouts can be caused by firewall rules."
]

# Create pairs: (Query, Doc1), (Query, Doc2)...
pairs = [[query, doc] for doc in documents]

# Score pairs
scores = model.predict(pairs)

# Combine and Sort
ranked_docs = sorted(
    zip(documents, scores), 
    key=lambda x: x[1], 
    reverse=True
)

for doc, score in ranked_docs:
    print(f"{score:.4f}: {doc}")

30.2.4. Cloud Native Reranking (Cohere & Vertex)

Managing Cross-Encoder latency/GPU infrastructure is painful. Cloud APIs offer “Rerank as a Service.”

Cohere Rerank API

Cohere’s Rerank 3 model is an industry benchmark.

import cohere

co = cohere.Client('YOUR_API_KEY')

results = co.rerank(
    query="What is the capital of Canada?",
    documents=[
        "Ottawa is the capital of Canada.",
        "Toronto is the largest city in Canada.",
        "Vancouver is in British Columbia."
    ],
    top_n=1,
    model="rerank-english-v3.0"
)

print(results)

Advantages of Cloud Reranking

Zero Ops: No GPU cluster to manage for the reranker.
Performance: These models are massive (billions of parameters) compared to what you’d run locally (MiniLM ~30M params).
Context: 4k+ context windows for the document chunks.

30.2.5. Advanced Embedding: Matryoshka Representation Learning (MRL)

A cutting-edge technique (2024) to make hybrid search cheaper and faster.

The Dimensionality Problem

Standard OpenAI embeddings are 1536 dimensions. That’s a lot of storage and compute for the database. What if you could slice the vector?

Use the first 256 dimensions for fast, coarse search.
Use the full 1536 dimensions for fine-grained re-scoring.

Usually, slicing a vector destroys its meaning. Matryoshka Representation Learning (MRL) trains embedding models such that the most important information is front-loaded in the earlier dimensions.

Implications for MLOps

Storage Savings: Store only 512 dims but get 95% of the performance of 1536 dims.
Adaptive Retrieval:
- Shortlist: Search 1M docs using first 64 dimensions (extremely fast).
- Refine: Rescore top 1000 using full 768 dimensions.

New OpenAI models (text-embedding-3-small/large) support this natively.

# OpenAI MRL Example
from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text, dimensions=1536):
    # The API natively supports 'dimensions' parameter for newer models
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=text,
        dimensions=dimensions 
    )
    return response.data[0].embedding

# Get a reduced dimension embedding directly
short_vec = get_embedding("Hello world", dimensions=256)
full_vec = get_embedding("Hello world", dimensions=3072)

# Theoretically, short_vec ≈ full_vec[:256] (normalized) for MRL models

30.2.6. Fine-Tuning Embeddings for Domain Specificity

Sometimes hybrid search fails because the base model doesn’t know your domain.

General Model: Thinks “Apple” is a fruit or a laptop.
Your Domain (Finance): “Apple” is primarily a stock ticker AAPL.

Triplet Loss Training

To fine-tune, you need “Triplets”:

Anchor: The query (“Apple price”)
Positive: The correct doc (“AAPL closed at $150…”)
Negative: An incorrect doc (“Granny Smith apples are $2/lb…”)

The goal is to move the Anchor closer to the Positive and further from the Negative in vector space.

Implementation code (SentenceTransformers)

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# 1. Load Pre-trained Model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# 2. Prepare Data (Anchor, Positive, Negative)
train_examples = [
    InputExample(texts=['Apple price', 'AAPL stock closed at 150', 'Oranges are 2.99']),
    InputExample(texts=['Python error', 'ImportError: no module', 'The python snake is large'])
]

# 3. Create Dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# 4. Define Loss (Multiple Negatives Ranking Loss is powerful)
train_loss = losses.MultipleNegativesRankingLoss(model)

# 5. Train
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    warmup_steps=100,
    output_path='./fine-tuned-bge'
)

# Now 'Apple' will be much closer to 'AAPL' in the vector space

30.2.7. Deep Dive: Tuning Sparse Retrieval (BM25)

Sparse retrieval is not “set and forget.” The Okapi BM25 algorithm has two critical hyperparameters that control how it ranks documents.

The BM25 Formula

$$ Score(D, Q) = \sum_{q \in Q} IDF(q) \cdot \frac{f(q, D) \cdot (k_1 + 1)}{f(q, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} $$

Tuning $k_1$ (Term Saturation)

Definition: Controls how quickly the term frequency saturates.
Range: Typically 1.2 to 2.0.
Low $k_1$ (e.g., 1.2): “Mentioning the keyword once is enough.” Additional occurrences don’t add much score.
High $k_1$ (e.g., 2.0): “More is better.” Documents repeating the keyword 10 times score much higher than 1 time.
RAG Advice: Use lower $k_1$ (1.2 - 1.5). RAG contexts just need the fact present; spamming the keyword doesn’t make the fact truer.

Tuning $b$ (Length Normalization)

Definition: Controls how much we penalize long documents.
Range: 0.0 to 1.0.
$b = 1.0$: Full normalization. A keyword match in a 100-word doc is worth 10x more than in a 1000-word doc.
$b = 0.0$: No normalization. Length doesn’t matter.
RAG Advice: Since we usually chunk documents into fixed sizes (e.g., 512 tokens), $b$ matters less. Set $b=0.75$ (standard).

30.2.8. Implementation: Custom Sparse Index with Redis

Sometimes cloud providers (like Pinecone) don’t give you enough control over the keyword index. Building a high-performance sparse index on Redis is a common MLOps pattern for low-latency RAG.

Redis Architecture

Keys: Term (Token).
Value: Sorted Set (ZSET) of Document IDs, scored by TF-IDF/BM25.

Python Code: Redis Inverted Index

import redis
import math
from collections import Counter

r = redis.Redis(host='localhost', port=6379, db=0)

def tokenize(text):
    return text.lower().split() # Simplified

def index_document(doc_id, text):
    terms = tokenize(text)
    term_counts = Counter(terms)
    doc_len = len(terms)
    
    # Update global stats (for IDF)
    r.incr("stats:doc_count")
    r.incrby("stats:total_tokens", doc_len)
    
    # Pipeline for speed
    pipe = r.pipeline()
    for term, count in term_counts.items():
        # Store raw TF for now. In query time we compute BM25.
        pipe.zadd(f"idx:{term}", {doc_id: count})
    pipe.execute()

def query_bm25(query, k1=1.2, b=0.75):
    terms = tokenize(query)
    doc_scores = Counter()
    
    # Fetch global stats
    N = int(r.get("stats:doc_count") or 1)
    avgdl = int(r.get("stats:total_tokens") or 1) / N
    
    for term in terms:
        # Get posting list (Doc IDs and TF)
        postings = r.zrange(f"idx:{term}", 0, -1, withscores=True)
        
        # Calculate IDF
        df = len(postings)
        idf = math.log(1 + (N - df + 0.5) / (df + 0.5))
        
        for doc_id_bytes, tf in postings:
            doc_id = doc_id_bytes.decode('utf-8')
            # Assuming we store doc_len separately or approximate it
            doc_len = avgdl # Approximation for simplicity
            
            # BM25 Component
            numerator = tf * (k1 + 1)
            denominator = tf + k1 * (1 - b + b * (doc_len / avgdl))
            score = idf * (numerator / denominator)
            
            doc_scores[doc_id] += score
            
    return doc_scores.most_common(10)

# This implementation runs in < 5ms for typical RAG vocabularies

30.2.9. Latency Optimization: Distilling Cross-Encoders

Cross-Encoders are slow. A standard BERT-base reranker takes ~200-500ms on CPU to score 50 documents. This is often the bottleneck of the whole RAG pipeline.

Strategy 1: Interaction-based vs Representation-based

Cross-Encoder: Full attention interaction. Slow. $O(N^2)$ attention.
ColBERT (Late Interaction):
- Computes token embeddings independently (like Bi-Encoder).
- Computes “MaxSim” interaction at the end (cheap matrix math).
- Result: 10x faster than Cross-Encoder with 95% of the quality.

Strategy 2: Quantization (ONNX + INT8)

Deploy the reranker as an INT8 ONNX model.

Export: optimum-cli export onnx --model cross-encoder/ms-marco-MiniLM-L-6-v2 ./onnx_model
Quantize: Use onnxruntime to dynamic quantize weights.
Speedup: 2-3x speedup on CPU.

Strategy 3: Caching

Reranking results are deterministic for (Query, Document_Set).

Hash: sha256(query + sorted(doc_ids))
Cache: Store the reranked list in Redis.
Hit Rate: Often low for unique user queries, but high for “Trending Questions” or “Suggested Prompts”.

30.2.10. Evaluating Hyrbid Search Quality

How do you know if your expensive Hybrid setup is better than simple Vector search?

1. NDCG@10 (Normalized Discounted Cumulative Gain)

Meaning: “Did I get the relevant docs at the very top?”
Calculation: Penalizes relevant documents appearing at rank 5 instead of rank 1.
Use Case: General ranking quality.

2. Recall@K

Meaning: “Is the answer somewhere in the top K?”
Use Case: Evaluating the Retrieval stage (Stage 1). If the Retrieval stage misses the document, the Reranker (Stage 2) can never fix it.

3. MRR (Mean Reciprocal Rank)

Meaning: “On average, at what rank does the first correct answer appear?”
Formula: $\frac{1}{rank}$. If answer is at rank 1, score 1. Rank 2, score 0.5.

Evaluation Code (Python)

import numpy as np

def calculate_ndcg(retrieved_ids, relevant_ids, k=10):
    dcg = 0
    idcg = 0
    
    # 1. Calculate DCG
    for i, doc_id in enumerate(retrieved_ids[:k]):
        if doc_id in relevant_ids:
            rel = 1 # Binary relevance
            dcg += rel / np.log2(i + 2)
            
    # 2. Calculate Ideal DCG (if all relevant were at top)
    num_relevant = min(len(relevant_ids), k)
    for i in range(num_relevant):
        idcg += 1 / np.log2(i + 2)
        
    if idcg == 0: return 0.0
    return dcg / idcg

# Example
ground_truth = {"q1": ["doc_A", "doc_B"]} # Gold standard
system_output = ["doc_C", "doc_A", "doc_D"] # System guess

score = calculate_ndcg(system_output, ground_truth["q1"])
print(f"NDCG Score: {score}")

30.2.12. Advanced Query Expansion: HyDE and Multi-Query

A user’s query is often a poor representation of what they are looking for. Strategies to “fix” the query before searching are highly effective.

1. Hypothetical Document Embeddings (HyDE)

Intuition: Embeddings align “Questions” with “Questions” and “Answers” with “Answers.” Searching for a Question in an Answer space is suboptimal.
Technique:
1. Ask an LLM to “hallucinate” a fake answer to the user’s question.
2. Embed the fake answer.
3. Search the vector DB using the fake answer vector.
Result: The fake answer’s vector is semantically closer to the real answer than the question was.

from langchain.chains import HydeChain
from langchain_openai import OpenAI, OpenAIEmbeddings

# 1. Generate Fake Document
llm = OpenAI()
embeddings = OpenAIEmbeddings()

hyde_chain = HydeChain.from_llm(llm, base_embeddings=embeddings, custom_prompt="Write a scientific abstract answering: {question}")

# 2. Get Vector
fake_doc_vector = hyde_chain.generate(["What is the impact of rainfall on crop yield?"])

# 3. Search
search_results = vector_db.search(fake_doc_vector)

2. Multi-Query Expansion

Users are lazy. They type “login error.”

Technique: Ask LLM to generate 5 variations: “How to fix login failure,” “Authentication timeout troubleshooting,” “Sign-in error 403 context.”
Execute: Run 5 parallel searches.
Fuse: De-duplicate results using RRF.

30.2.13. Learned Sparse Retrieval: SPLADE

BM25 is “unsupervised” (just counts words). SPLADE (Sparse Lexical and Expansion Model) is a neural network that generates sparse vectors.

How SPLADE Works

It maps a sentence to a sparse vector of size 30,000 (vocabulary size), but only activates ~100 dimensions—crucially, it activates synonyms that weren’t in the text.

Input: “The car is fast.”
SPLADE Output: {"car": 2.1, "vehicle": 1.5, "fast": 1.8, "speed": 1.2}.
Note: It learned “vehicle” and “speed” even though they weren’t in the input!

Using SPLADE in Production

SPLADE vectors can be stored in Elasticsearch or Redis just like BM25 vectors.

Pros: Solves the “mismatch vocabulary” problem without dense vectors.
Cons: Inference cost (BERT forward pass) during indexing and querying.

30.2.14. Ensemble Retrievers: The Kitchen Sink Approach

The most robust RAG systems use a “Voting” mechanism across different algorithms.

Architecture: The “Ensemble”

Retriever A: Dense (OpenAI Embeddings). Captures semantic meaning.
Retriever B: Sparse (BM25). Captures exact keywords.
Retriever C: Domain-Specific (e.g., SQL Retriever for structured data).

weighted Fusion Code

def weighted_ensemble_search(query, weights={'dense': 0.7, 'sparse': 0.3}):
    # 1. Run Parallel Searches
    dense_results = vector_store.similarity_search_with_score(query, k=50)
    sparse_results = bm25_retriever.get_relevant_documents(query)
    
    # 2. Normalize Scores (Min-Max Scaling)
    # Critical because Cosine is 0-1 but BM25 is 0-25
    dense_norm = normalize([s for doc, s in dense_results])
    sparse_norm = normalize([doc.metadata['score'] for doc in sparse_results])
    
    # 3. Combine
    final_scores = defaultdict(float)
    
    for i, (doc, _) in enumerate(dense_results):
        final_scores[doc.page_content] += dense_norm[i] * weights['dense']
        
    for i, doc in enumerate(sparse_results):
        final_scores[doc.page_content] += sparse_norm[i] * weights['sparse']
        
    return sorted(final_scores.items(), key=lambda x: x[1], reverse=True)

30.2.15. Latency Benchmarking: Reranker Impact

Adding a Reranker is the biggest latency penalty in RAG. You must measure it.

Benchmark Results (Typical CPU)

Model	Type	Latency (50 docs)	Quality (NDCG)
Cross-Encoder (Big)	BERT-Large	800ms	0.85 (SOTA)
Cross-Encoder (Small)	MiniLM-L6	150ms	0.82
ColBERT (Late Interaction)	PyTorch	25ms	0.84
Bi-Encoder only	Cosine	5ms	0.70

Optimization Strategy

The “Waterfall”:
- Query -> Bi-Encoder (Top 100).
- Fast Reranker (MiniLM) -> Top 20.
- Slow Reranker (GPT-4 or Big BERT) -> Top 5.
Use ColBERT: Ideally, replace the Cross-Encoder with ColBERT for the best speed/quality ratio.

30.2.17. Case Study: Hybrid Search in E-Commerce (Home Furnishing)

E-commerce is the proving ground for Hybrid Search. Users search for high-intent keywords but expect semantic understanding.

The Problem

Users search for “Mid-century modern velvet sofa blue”.

Vector Search: Returns “Blue mid-century chair” (Semantic match, wrong object).
Keyword Search: Returns “Blue Velvet Shirt” (Keyword match, wrong category).

The Architecture: 3-Way Ensemble

They implemented a weighted ensemble using Elasticsearch:

Dense: HNSW on title_embedding (Weight: 0.4).
- Captures style (“Mid-century”).
Sparse: BM25 on title and description (Weight: 0.3).
- Captures specific materials (“Velvet”, “Blue”).
Structured: Filter on category_id (Weight: 0.3).
- Ensures result is actually a “Sofa”.

The “Reranking” Latency Trap

They tried a Cross-Encoder on the top 50 items. P99 latency spiked to 600ms.

Fix: Distilled the Cross-Encoder into a gradient boosted tree (XGBoost) using simple features + one “similarity score” feature.
Result: 50ms P99.

30.2.18. War Story: The “Zero Result” Crisis

“We deployed the new Hybrid RAG system. The CEO searched for ‘The IT Department Strategy’, and got ZERO results. Panic ensued.”

The Incident

Search: “The IT Department Strategy”
Vector Results: Returned 10 relevant docs.
Keyword Results: Returned 0 docs.
Hybrid Logic: AND operator between Vector and Keyword (Intersection).

Root Cause

The standard BM25 analyzer was configured with a Stop Word Filter that removed “The”, “IT”, “Department”.

“IT” was considered a stop word (common usage).
“Department” was considered generic.
“Strategy” was the only term left.
But the index didn’t match documents because the tokenizer stripped “IT” during ingestion but NOT during query time (configuration drift).

The Fix

Change Logic: Switch to OR logic (Union) with RRF boosting for intersection. Never use hard AND between modalities.
Fix Tokenizer: Removing “IT” as a stop word is a classic mistake in tech companies.
Validation: Added a “Zero Result” monitor. If > 5% of queries have 0 results, alert the team.

Lesson: Intersection (AND) is dangerous in Hybrid Search. Always use Union (OR) + Ranking.

30.2.19. Interview Questions

Q1: Explain Reciprocal Rank Fusion (RRF). Why is it better than summing scores?

Answer: BM25 scores are unbounded (0 to infinity), while Cosine Similarity is 0 to 1. Summing them allows BM25 to dominate. RRF ignores the absolute score and relies only on the rank position ($\frac{1}{k + rank}$). It is robust to different scale distributions.

Q2: When would you prefer a Bi-Encoder over a Cross-Encoder?

Answer: For Stage 1 Retrieval. Bi-Encoders allow pre-computing embeddings for 10M documents, enabling fast $O(1)$ search. Cross-Encoders require processing $(Query, Doc)$ pairs at runtime ($O(N)$), which is too slow for retrieval but perfect for Stage 2 Reranking.

Q3: How does SPLADE differ from BM25?

Answer: BM25 relies on exact term matching. If the user types “car” and the doc says “auto”, BM25 fails (unless you add synonyms). SPLADE uses a BERT model to “expand” the document vector to include relevant synonyms (“auto”) during indexing, solving the vocabulary mismatch problem while keeping the vector sparse.

30.2.20. Summary: The Production Retrieval Stack

A production-grade RAG retrieval pipeline looks like this:

Query Analysis: Rewriting/Expanding the query (HyDE).
Hybrid Retrieval (Parallel):
- Vector Search: HNSW index, top-k=100.
- Keyword Search: BM25 index, top-k=100.
Result Fusion: RRF to combine the two lists into a unified top-100.
Reranking: Cross-Encoder (or ColBERT) to score the top-100.
Selection: top-5 passed to Context Window.

This pipeline mitigates the hallucinations caused by “retrieving the wrong thing” and is the single biggest quality upgrade you can make to a RAG system.

Keyboard shortcuts

The MLOps Omni-Reference