Chapter 30.2: Hybrid Search Patterns
“Vectors are great for concepts, but terrible for part numbers. If a user searches for ‘Error 504 on host app-09’, a vector search might return ‘Network Timeout on server web-01’, which is semantically similar but factually useless. Keyword search is undefeated for exact matches.” — Search Engineering Principles
30.2.1. The Limits of Dense Retrieval
Vector search (Dense Retrieval) works by mapping queries and documents to a high-dimensional semantic space.
- Query: “How do I reset my password?”
- Match: “Account credential recovery process” (Semantic match, no shared words).
This is magic. But it fails in specific, critical RAG scenarios:
- Exact Match: Product SKUs, Error Codes, Acronyms (“API-902”).
- Out-of-Vocabulary Terms: Proper nouns or internal jargon the embedding model never saw during training.
- Negation: “Show me laptops that are NOT Apple.” Vectors struggle heavily with negation.
The Solution: Hybrid Search
Hybrid search combines the best of both worlds:
- Dense Retrieval (KNN): Understanding intent and meaning.
- Sparse Retrieval (BM25/TF-IDF): Matching precise keywords.
30.2.2. Architecture: RRF (Reciprocal Rank Fusion)
How do you combine a list of results from a vector search and a keyword search? They have different scoring scales.
- Vector Score (Cosine Similarity): 0.0 to 1.0.
- BM25 Score: 0 to $\infty$ (unbounded).
You cannot simply add Vector_Score + BM25_Score.
Reciprocal Rank Fusion (RRF) is the industry standard algorithm for merging ranked lists without needing to normalize the scores. It relies only on the rank position.
The RRF Formula
$$ RRF_score(d) = \sum_{r \in R} \frac{1}{k + rank(d, r)} $$ Where:
- $d$: Document
- $R$: Set of rank lists (e.g., Vector List, Keyword List)
- $k$: Constant (usually 60) to mitigate the impact of high rankings.
Python Implementation of RRF
from collections import defaultdict
def reciprocal_rank_fusion(
vector_results: list[str],
keyword_results: list[str],
k: int = 60
) -> list[tuple[str, float]]:
"""
Fuses two ranked lists using RRF.
Results are lists of document IDs, ordered by score (highest first).
"""
fused_scores = defaultdict(float)
# Process Vector Results
for rank, doc_id in enumerate(vector_results):
fused_scores[doc_id] += 1 / (k + rank + 1)
# Process Keyword Results
for rank, doc_id in enumerate(keyword_results):
fused_scores[doc_id] += 1 / (k + rank + 1)
# Sort by fused score descending
sorted_results = sorted(
fused_scores.items(),
key=lambda x: x[1],
reverse=True
)
return sorted_results
# Example Usage
docs_vector = ["doc_A", "doc_B", "doc_C", "doc_D"]
docs_keyword = ["doc_C", "doc_A", "doc_E", "doc_B"]
final_ranking = reciprocal_rank_fusion(docs_vector, docs_keyword)
print(final_ranking)
# Output might prioritize doc_A and doc_C as they appear in both.
30.2.3. The Two-Stage Retrieval Pattern
In production RAG systems, we rarely just “Search and Feed to LLM.” We use a Retrieve-Then-Rerank architecture.
Stage 1: Retrieval (High Recall)
Goal: Get all potentially relevant documents quickly.
- Method: Hybrid Search (Vector + Keyword).
- Count: Retrieve top 50-100 documents.
- Speed: < 50ms.
Stage 2: Reranking (High Precision)
Goal: Sort the top 100 to find the absolute best 5 for the LLM context window.
- Method: Cross-Encoder Model.
- Count: Return top 5-10.
- Speed: ~200-500ms (slower, computationally expensive).
Bi-Encoders vs. Cross-Encoders
| Feature | Bi-Encoder (Embeddings) | Cross-Encoder (Reranker) |
|---|---|---|
| Architecture | Siamese Network. Encodes query and doc separately. | Single Transformer. Encodes query and doc together. |
| Input | bert(Query) vs bert(Doc) | bert([CLS] Query [SEP] Doc) |
| Mechanism | Cosine Similarity of vectors. | Full Self-Attention between Query and Doc tokens. |
| Accuracy | Good. | Excellent (captures nuance/interaction). |
| Speed | Fast (0.1ms search). | Slow (requires inference per pair). |
| Role | Stage 1 (Retrieval) | Stage 2 (Reranking) |
Implementation: SentenceTransformers Cross-Encoder
from sentence_transformers import CrossEncoder
# Load a reranker model (e.g., MS MARCO trained)
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "How do I fix a connection timeout?"
documents = [
"Check your internet cable.",
"The 504 Gateway Timeout error means...",
"To bake a cake, you need flour.", # Irrelevant
"Connection timeouts can be caused by firewall rules."
]
# Create pairs: (Query, Doc1), (Query, Doc2)...
pairs = [[query, doc] for doc in documents]
# Score pairs
scores = model.predict(pairs)
# Combine and Sort
ranked_docs = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True
)
for doc, score in ranked_docs:
print(f"{score:.4f}: {doc}")
30.2.4. Cloud Native Reranking (Cohere & Vertex)
Managing Cross-Encoder latency/GPU infrastructure is painful. Cloud APIs offer “Rerank as a Service.”
Cohere Rerank API
Cohere’s Rerank 3 model is an industry benchmark.
import cohere
co = cohere.Client('YOUR_API_KEY')
results = co.rerank(
query="What is the capital of Canada?",
documents=[
"Ottawa is the capital of Canada.",
"Toronto is the largest city in Canada.",
"Vancouver is in British Columbia."
],
top_n=1,
model="rerank-english-v3.0"
)
print(results)
Advantages of Cloud Reranking
- Zero Ops: No GPU cluster to manage for the reranker.
- Performance: These models are massive (billions of parameters) compared to what you’d run locally (MiniLM ~30M params).
- Context: 4k+ context windows for the document chunks.
30.2.5. Advanced Embedding: Matryoshka Representation Learning (MRL)
A cutting-edge technique (2024) to make hybrid search cheaper and faster.
The Dimensionality Problem
Standard OpenAI embeddings are 1536 dimensions. That’s a lot of storage and compute for the database. What if you could slice the vector?
- Use the first 256 dimensions for fast, coarse search.
- Use the full 1536 dimensions for fine-grained re-scoring.
Usually, slicing a vector destroys its meaning. Matryoshka Representation Learning (MRL) trains embedding models such that the most important information is front-loaded in the earlier dimensions.
Implications for MLOps
- Storage Savings: Store only 512 dims but get 95% of the performance of 1536 dims.
- Adaptive Retrieval:
- Shortlist: Search 1M docs using first 64 dimensions (extremely fast).
- Refine: Rescore top 1000 using full 768 dimensions.
New OpenAI models (text-embedding-3-small/large) support this natively.
# OpenAI MRL Example
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text, dimensions=1536):
# The API natively supports 'dimensions' parameter for newer models
response = client.embeddings.create(
model="text-embedding-3-large",
input=text,
dimensions=dimensions
)
return response.data[0].embedding
# Get a reduced dimension embedding directly
short_vec = get_embedding("Hello world", dimensions=256)
full_vec = get_embedding("Hello world", dimensions=3072)
# Theoretically, short_vec ≈ full_vec[:256] (normalized) for MRL models
30.2.6. Fine-Tuning Embeddings for Domain Specificity
Sometimes hybrid search fails because the base model doesn’t know your domain.
- General Model: Thinks “Apple” is a fruit or a laptop.
- Your Domain (Finance): “Apple” is primarily a stock ticker AAPL.
Triplet Loss Training
To fine-tune, you need “Triplets”:
- Anchor: The query (“Apple price”)
- Positive: The correct doc (“AAPL closed at $150…”)
- Negative: An incorrect doc (“Granny Smith apples are $2/lb…”)
The goal is to move the Anchor closer to the Positive and further from the Negative in vector space.
Implementation code (SentenceTransformers)
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# 1. Load Pre-trained Model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
# 2. Prepare Data (Anchor, Positive, Negative)
train_examples = [
InputExample(texts=['Apple price', 'AAPL stock closed at 150', 'Oranges are 2.99']),
InputExample(texts=['Python error', 'ImportError: no module', 'The python snake is large'])
]
# 3. Create Dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# 4. Define Loss (Multiple Negatives Ranking Loss is powerful)
train_loss = losses.MultipleNegativesRankingLoss(model)
# 5. Train
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
output_path='./fine-tuned-bge'
)
# Now 'Apple' will be much closer to 'AAPL' in the vector space
30.2.7. Deep Dive: Tuning Sparse Retrieval (BM25)
Sparse retrieval is not “set and forget.” The Okapi BM25 algorithm has two critical hyperparameters that control how it ranks documents.
The BM25 Formula
$$ Score(D, Q) = \sum_{q \in Q} IDF(q) \cdot \frac{f(q, D) \cdot (k_1 + 1)}{f(q, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} $$
Tuning $k_1$ (Term Saturation)
- Definition: Controls how quickly the term frequency saturates.
- Range: Typically 1.2 to 2.0.
- Low $k_1$ (e.g., 1.2): “Mentioning the keyword once is enough.” Additional occurrences don’t add much score.
- High $k_1$ (e.g., 2.0): “More is better.” Documents repeating the keyword 10 times score much higher than 1 time.
- RAG Advice: Use lower $k_1$ (1.2 - 1.5). RAG contexts just need the fact present; spamming the keyword doesn’t make the fact truer.
Tuning $b$ (Length Normalization)
- Definition: Controls how much we penalize long documents.
- Range: 0.0 to 1.0.
- $b = 1.0$: Full normalization. A keyword match in a 100-word doc is worth 10x more than in a 1000-word doc.
- $b = 0.0$: No normalization. Length doesn’t matter.
- RAG Advice: Since we usually chunk documents into fixed sizes (e.g., 512 tokens), $b$ matters less. Set $b=0.75$ (standard).
30.2.8. Implementation: Custom Sparse Index with Redis
Sometimes cloud providers (like Pinecone) don’t give you enough control over the keyword index. Building a high-performance sparse index on Redis is a common MLOps pattern for low-latency RAG.
Redis Architecture
- Keys: Term (Token).
- Value: Sorted Set (
ZSET) of Document IDs, scored by TF-IDF/BM25.
Python Code: Redis Inverted Index
import redis
import math
from collections import Counter
r = redis.Redis(host='localhost', port=6379, db=0)
def tokenize(text):
return text.lower().split() # Simplified
def index_document(doc_id, text):
terms = tokenize(text)
term_counts = Counter(terms)
doc_len = len(terms)
# Update global stats (for IDF)
r.incr("stats:doc_count")
r.incrby("stats:total_tokens", doc_len)
# Pipeline for speed
pipe = r.pipeline()
for term, count in term_counts.items():
# Store raw TF for now. In query time we compute BM25.
pipe.zadd(f"idx:{term}", {doc_id: count})
pipe.execute()
def query_bm25(query, k1=1.2, b=0.75):
terms = tokenize(query)
doc_scores = Counter()
# Fetch global stats
N = int(r.get("stats:doc_count") or 1)
avgdl = int(r.get("stats:total_tokens") or 1) / N
for term in terms:
# Get posting list (Doc IDs and TF)
postings = r.zrange(f"idx:{term}", 0, -1, withscores=True)
# Calculate IDF
df = len(postings)
idf = math.log(1 + (N - df + 0.5) / (df + 0.5))
for doc_id_bytes, tf in postings:
doc_id = doc_id_bytes.decode('utf-8')
# Assuming we store doc_len separately or approximate it
doc_len = avgdl # Approximation for simplicity
# BM25 Component
numerator = tf * (k1 + 1)
denominator = tf + k1 * (1 - b + b * (doc_len / avgdl))
score = idf * (numerator / denominator)
doc_scores[doc_id] += score
return doc_scores.most_common(10)
# This implementation runs in < 5ms for typical RAG vocabularies
30.2.9. Latency Optimization: Distilling Cross-Encoders
Cross-Encoders are slow. A standard BERT-base reranker takes ~200-500ms on CPU to score 50 documents. This is often the bottleneck of the whole RAG pipeline.
Strategy 1: Interaction-based vs Representation-based
- Cross-Encoder: Full attention interaction. Slow. $O(N^2)$ attention.
- ColBERT (Late Interaction):
- Computes token embeddings independently (like Bi-Encoder).
- Computes “MaxSim” interaction at the end (cheap matrix math).
- Result: 10x faster than Cross-Encoder with 95% of the quality.
Strategy 2: Quantization (ONNX + INT8)
Deploy the reranker as an INT8 ONNX model.
- Export:
optimum-cli export onnx --model cross-encoder/ms-marco-MiniLM-L-6-v2 ./onnx_model - Quantize: Use
onnxruntimeto dynamic quantize weights. - Speedup: 2-3x speedup on CPU.
Strategy 3: Caching
Reranking results are deterministic for (Query, Document_Set).
- Hash:
sha256(query + sorted(doc_ids)) - Cache: Store the reranked list in Redis.
- Hit Rate: Often low for unique user queries, but high for “Trending Questions” or “Suggested Prompts”.
30.2.10. Evaluating Hyrbid Search Quality
How do you know if your expensive Hybrid setup is better than simple Vector search?
1. NDCG@10 (Normalized Discounted Cumulative Gain)
- Meaning: “Did I get the relevant docs at the very top?”
- Calculation: Penalizes relevant documents appearing at rank 5 instead of rank 1.
- Use Case: General ranking quality.
2. Recall@K
- Meaning: “Is the answer somewhere in the top K?”
- Use Case: Evaluating the Retrieval stage (Stage 1). If the Retrieval stage misses the document, the Reranker (Stage 2) can never fix it.
3. MRR (Mean Reciprocal Rank)
- Meaning: “On average, at what rank does the first correct answer appear?”
- Formula: $\frac{1}{rank}$. If answer is at rank 1, score 1. Rank 2, score 0.5.
Evaluation Code (Python)
import numpy as np
def calculate_ndcg(retrieved_ids, relevant_ids, k=10):
dcg = 0
idcg = 0
# 1. Calculate DCG
for i, doc_id in enumerate(retrieved_ids[:k]):
if doc_id in relevant_ids:
rel = 1 # Binary relevance
dcg += rel / np.log2(i + 2)
# 2. Calculate Ideal DCG (if all relevant were at top)
num_relevant = min(len(relevant_ids), k)
for i in range(num_relevant):
idcg += 1 / np.log2(i + 2)
if idcg == 0: return 0.0
return dcg / idcg
# Example
ground_truth = {"q1": ["doc_A", "doc_B"]} # Gold standard
system_output = ["doc_C", "doc_A", "doc_D"] # System guess
score = calculate_ndcg(system_output, ground_truth["q1"])
print(f"NDCG Score: {score}")
30.2.12. Advanced Query Expansion: HyDE and Multi-Query
A user’s query is often a poor representation of what they are looking for. Strategies to “fix” the query before searching are highly effective.
1. Hypothetical Document Embeddings (HyDE)
- Intuition: Embeddings align “Questions” with “Questions” and “Answers” with “Answers.” Searching for a Question in an Answer space is suboptimal.
- Technique:
- Ask an LLM to “hallucinate” a fake answer to the user’s question.
- Embed the fake answer.
- Search the vector DB using the fake answer vector.
- Result: The fake answer’s vector is semantically closer to the real answer than the question was.
from langchain.chains import HydeChain
from langchain_openai import OpenAI, OpenAIEmbeddings
# 1. Generate Fake Document
llm = OpenAI()
embeddings = OpenAIEmbeddings()
hyde_chain = HydeChain.from_llm(llm, base_embeddings=embeddings, custom_prompt="Write a scientific abstract answering: {question}")
# 2. Get Vector
fake_doc_vector = hyde_chain.generate(["What is the impact of rainfall on crop yield?"])
# 3. Search
search_results = vector_db.search(fake_doc_vector)
2. Multi-Query Expansion
Users are lazy. They type “login error.”
- Technique: Ask LLM to generate 5 variations: “How to fix login failure,” “Authentication timeout troubleshooting,” “Sign-in error 403 context.”
- Execute: Run 5 parallel searches.
- Fuse: De-duplicate results using RRF.
30.2.13. Learned Sparse Retrieval: SPLADE
BM25 is “unsupervised” (just counts words). SPLADE (Sparse Lexical and Expansion Model) is a neural network that generates sparse vectors.
How SPLADE Works
It maps a sentence to a sparse vector of size 30,000 (vocabulary size), but only activates ~100 dimensions—crucially, it activates synonyms that weren’t in the text.
- Input: “The car is fast.”
- SPLADE Output:
{"car": 2.1, "vehicle": 1.5, "fast": 1.8, "speed": 1.2}. - Note: It learned “vehicle” and “speed” even though they weren’t in the input!
Using SPLADE in Production
SPLADE vectors can be stored in Elasticsearch or Redis just like BM25 vectors.
- Pros: Solves the “mismatch vocabulary” problem without dense vectors.
- Cons: Inference cost (BERT forward pass) during indexing and querying.
30.2.14. Ensemble Retrievers: The Kitchen Sink Approach
The most robust RAG systems use a “Voting” mechanism across different algorithms.
Architecture: The “Ensemble”
- Retriever A: Dense (OpenAI Embeddings). Captures semantic meaning.
- Retriever B: Sparse (BM25). Captures exact keywords.
- Retriever C: Domain-Specific (e.g., SQL Retriever for structured data).
weighted Fusion Code
def weighted_ensemble_search(query, weights={'dense': 0.7, 'sparse': 0.3}):
# 1. Run Parallel Searches
dense_results = vector_store.similarity_search_with_score(query, k=50)
sparse_results = bm25_retriever.get_relevant_documents(query)
# 2. Normalize Scores (Min-Max Scaling)
# Critical because Cosine is 0-1 but BM25 is 0-25
dense_norm = normalize([s for doc, s in dense_results])
sparse_norm = normalize([doc.metadata['score'] for doc in sparse_results])
# 3. Combine
final_scores = defaultdict(float)
for i, (doc, _) in enumerate(dense_results):
final_scores[doc.page_content] += dense_norm[i] * weights['dense']
for i, doc in enumerate(sparse_results):
final_scores[doc.page_content] += sparse_norm[i] * weights['sparse']
return sorted(final_scores.items(), key=lambda x: x[1], reverse=True)
30.2.15. Latency Benchmarking: Reranker Impact
Adding a Reranker is the biggest latency penalty in RAG. You must measure it.
Benchmark Results (Typical CPU)
| Model | Type | Latency (50 docs) | Quality (NDCG) |
|---|---|---|---|
| Cross-Encoder (Big) | BERT-Large | 800ms | 0.85 (SOTA) |
| Cross-Encoder (Small) | MiniLM-L6 | 150ms | 0.82 |
| ColBERT (Late Interaction) | PyTorch | 25ms | 0.84 |
| Bi-Encoder only | Cosine | 5ms | 0.70 |
Optimization Strategy
- The “Waterfall”:
- Query -> Bi-Encoder (Top 100).
- Fast Reranker (MiniLM) -> Top 20.
- Slow Reranker (GPT-4 or Big BERT) -> Top 5.
- Use ColBERT: Ideally, replace the Cross-Encoder with ColBERT for the best speed/quality ratio.
30.2.17. Case Study: Hybrid Search in E-Commerce (Home Furnishing)
E-commerce is the proving ground for Hybrid Search. Users search for high-intent keywords but expect semantic understanding.
The Problem
Users search for “Mid-century modern velvet sofa blue”.
- Vector Search: Returns “Blue mid-century chair” (Semantic match, wrong object).
- Keyword Search: Returns “Blue Velvet Shirt” (Keyword match, wrong category).
The Architecture: 3-Way Ensemble
They implemented a weighted ensemble using Elasticsearch:
- Dense: HNSW on
title_embedding(Weight: 0.4).- Captures style (“Mid-century”).
- Sparse: BM25 on
titleanddescription(Weight: 0.3).- Captures specific materials (“Velvet”, “Blue”).
- Structured: Filter on
category_id(Weight: 0.3).- Ensures result is actually a “Sofa”.
The “Reranking” Latency Trap
They tried a Cross-Encoder on the top 50 items. P99 latency spiked to 600ms.
- Fix: Distilled the Cross-Encoder into a gradient boosted tree (XGBoost) using simple features + one “similarity score” feature.
- Result: 50ms P99.
30.2.18. War Story: The “Zero Result” Crisis
“We deployed the new Hybrid RAG system. The CEO searched for ‘The IT Department Strategy’, and got ZERO results. Panic ensued.”
The Incident
- Search: “The IT Department Strategy”
- Vector Results: Returned 10 relevant docs.
- Keyword Results: Returned 0 docs.
- Hybrid Logic:
ANDoperator between Vector and Keyword (Intersection).
Root Cause
The standard BM25 analyzer was configured with a Stop Word Filter that removed “The”, “IT”, “Department”.
- “IT” was considered a stop word (common usage).
- “Department” was considered generic.
- “Strategy” was the only term left.
- But the index didn’t match documents because the tokenizer stripped “IT” during ingestion but NOT during query time (configuration drift).
The Fix
- Change Logic: Switch to
ORlogic (Union) with RRF boosting for intersection. Never use hardANDbetween modalities. - Fix Tokenizer: Removing “IT” as a stop word is a classic mistake in tech companies.
- Validation: Added a “Zero Result” monitor. If > 5% of queries have 0 results, alert the team.
Lesson: Intersection (AND) is dangerous in Hybrid Search. Always use Union (OR) + Ranking.
30.2.19. Interview Questions
Q1: Explain Reciprocal Rank Fusion (RRF). Why is it better than summing scores?
- Answer: BM25 scores are unbounded (0 to infinity), while Cosine Similarity is 0 to 1. Summing them allows BM25 to dominate. RRF ignores the absolute score and relies only on the rank position ($\frac{1}{k + rank}$). It is robust to different scale distributions.
Q2: When would you prefer a Bi-Encoder over a Cross-Encoder?
- Answer: For Stage 1 Retrieval. Bi-Encoders allow pre-computing embeddings for 10M documents, enabling fast $O(1)$ search. Cross-Encoders require processing $(Query, Doc)$ pairs at runtime ($O(N)$), which is too slow for retrieval but perfect for Stage 2 Reranking.
Q3: How does SPLADE differ from BM25?
- Answer: BM25 relies on exact term matching. If the user types “car” and the doc says “auto”, BM25 fails (unless you add synonyms). SPLADE uses a BERT model to “expand” the document vector to include relevant synonyms (“auto”) during indexing, solving the vocabulary mismatch problem while keeping the vector sparse.
30.2.20. Summary: The Production Retrieval Stack
A production-grade RAG retrieval pipeline looks like this:
- Query Analysis: Rewriting/Expanding the query (HyDE).
- Hybrid Retrieval (Parallel):
- Vector Search: HNSW index, top-k=100.
- Keyword Search: BM25 index, top-k=100.
- Result Fusion: RRF to combine the two lists into a unified top-100.
- Reranking: Cross-Encoder (or ColBERT) to score the top-100.
- Selection: top-5 passed to Context Window.
This pipeline mitigates the hallucinations caused by “retrieving the wrong thing” and is the single biggest quality upgrade you can make to a RAG system.