39.4. Ranking & Multi-Objective Optimization
Status: Draft Version: 1.0.0 Tags: #RecSys, #Ranking, #Rust, #MultiTask, #XGBoost Author: MLOps Team
Table of Contents
- The Ranker: Where Precision Matters
- Ranking Architecture: Cross-Encoders
- Learning to Rank (LTR) Objectives
- Multi-Objective Optimization (MOO)
- Architecture: MMOE (Multi-Gate Mixture-of-Experts)
- Rust Implementation: The Scoring Engine
- Infrastructure: Real-Time Feature Store
- Calibration: Trusting the Probabilities
- Case Study: Ads Ranking
- Troubleshooting: Ranking Issues
- MLOps Interview Questions
- Glossary
- Summary Checklist
Prerequisites
Before diving into this chapter, ensure you have the following installed:
- Rust: 1.70+ (
redis,rayon) - XGBoost: Understanding of Gradient Boosting Trees.
- Redis: For feature lookup.
The Ranker: Where Precision Matters
The Retrieval stage gave us 1,000 candidates. Now we have the budget to be smart. We can use Cross-Encoders (Deep Networks that take User and Item features together) and massive Feature Engineering.
| Stage | Count | Latency/Item | Model | Input |
|---|---|---|---|---|
| Retrieval | 100,000,000 | 10ns | Dot Product (ANN) | ID, Embeddings |
| Ranking | 1,000 | 10us | XGBoost / MLP | User History, Context, Item Stats |
| Re-Ranking | 50 | 1ms | Transformers | Business Logic |
Ranking Architecture: Cross-Encoders
Retrieval used Bi-Encoders (Two-Tower). Ranking uses Cross-Encoders.
Retrieval (Bi-Encoder): $$ Score = BERT(User) \cdot BERT(Item) $$ Fast. Missing interactions. User and Item never see each other until the end.
Ranking (Cross-Encoder): $$ Score = BERT(Concat(User, Item)) $$ The network sees both sets of features in the first layer. It can capture non-linear interactions:
- “Harry Potter” implies “Fantasy”.
- User loves “Fantasy”.
- Therefore -> Match.
For 1000 items, Cross-Encoder inference takes ~50ms on GPU. This is acceptable for Ranking, but not Retrieval.
Learning to Rank (LTR) Objectives
How do we train the Ranker? We don’t just want “Click / No Click”. We want “Item A is better than Item B”.
1. Pointwise
Treat it as Binary Classification. $L = LogLoss(y, \hat{y})$.
- Pros: Simple. Calibrated probabilities.
- Cons: Ignores sorting. Predicting 0.4 vs 0.5 for items at rank 1000 is useless hard work.
2. Pairwise (RankNet, BPR)
Input: $(u, i, j)$ where $i$ is clicked, $j$ is not. Output: $\hat{s}_i > \hat{s}_j$.
- Loss: $L = -\log \sigma(\hat{s}_i - \hat{s}_j)$.
- Pros: Optimizes the ordering directly.
3. Listwise (LambdaRank, LambdaMART)
Optimize the NDCG (Normalized Discounted Cumulative Gain) directly. Gradients are weighed by the change in NDCG if items were swapped.
- Lambda Gradient: $\lambda_{ij} = \frac{\partial \Delta NDCG}{\partial s_i} \cdot \frac{1}{1 + e^{s_i - s_j}}$
Multi-Objective Optimization (MOO)
Engagement is not enough. “Clickbait” gets high clicks but low dwell time. We have multiple targets:
- Click (CTR): $P(Click)$
- Conversion (CVR): $P(Buy | Click)$
- Dwell Time: $E[Seconds]$
The Fusion Formula
We train a multi-head model (one head per task). Then we combine them at inference.
$$ Score = w_1 \cdot P(Click) + w_2 \cdot P(Buy) \cdot V(Price) + w_3 \cdot \log(Dwell) $$
The weights $w_i$ are business decisions (“We value a purchase as 100x a click”).
Architecture: MMOE (Multi-Gate Mixture-of-Experts)
For conflicting tasks (e.g., “Click” vs “Dwell Time”), a Shared Bottom fails because optimization gradients cancel out. MMOE uses “Experts” (Sub-networks) and “Gates” (Attention mechanisms) to route tasks to relevant experts.
Task A (CTR) Output Task B (CVR) Output
^ ^
| |
+------+-------+ +-------+------+
| Gate Network | | Gate Network |
+------+-------+ +-------+------+
^ ^
| |
+------+--------------------------+------+
| Softmax Weights over Experts |
+------+-----------+-----------+---------+
| | |
[Expert 1] [Expert 2] [Expert 3]
^ ^ ^
+-----------+-----------+
|
Input Features
Rust Implementation: The Scoring Engine
In production, the Ranker is CPU-bound. We need highly optimized feature crossing and dot products.
Project Structure
ranker-engine/
├── Cargo.toml
└── src/
└── lib.rs
Cargo.toml:
[package]
name = "ranker-engine"
version = "0.1.0"
edition = "2021"
[dependencies]
rayon = "1.7"
serde = { version = "1.0", features = ["derive"] }
redis = "0.23"
src/lib.rs:
#![allow(unused)]
fn main() {
//! High-Performance Scoring Engine (Ranker).
//! Handles Feature Crossing, Model Inference, and Multi-Objective Fusion.
use rayon::prelude::*;
use std::collections::HashMap;
/// A dense feature vector for a single candidate item.
/// In production, this would be a flat float array or a sparse vector.
#[derive(Debug, Clone)]
pub struct RankerContext {
pub user_age: f32,
pub user_hist_ctr: f32,
pub item_price: f32,
pub item_ctr: f32,
// ... 100s of features
}
#[derive(Debug, Clone)]
pub struct RankedItem {
pub item_id: String,
pub final_score: f32,
pub scores: HashMap<String, f32>, // sub-scores for debugging
}
/// Weights defined by Product Manager dynamically in ZooKeeper/Consul
pub struct ObjectiveWeights {
pub click_weight: f32,
pub conversion_weight: f32,
pub revenue_weight: f32,
}
pub struct ScoringEngine {
weights: ObjectiveWeights,
}
impl ScoringEngine {
pub fn new(weights: ObjectiveWeights) -> Self {
Self { weights }
}
/// Mock prediction function (Replace with ONNX call or Tree Traversal)
/// This simulates the "Shared Bottom" output being routed to a head.
fn predict_ctr(&self, ctx: &RankerContext) -> f32 {
// Logistic Sigmoid
let logit = 0.5 * ctx.user_hist_ctr + 0.5 * ctx.item_ctr;
1.0 / (1.0 + (-logit).exp())
}
fn predict_cvr(&self, ctx: &RankerContext) -> f32 {
let logit = -0.1 * ctx.item_price + 0.1 * ctx.user_age;
1.0 / (1.0 + (-logit).exp())
}
/// Parallel Ranking of Candidates.
/// Uses Rayon to split the workload across CPU cores.
pub fn rank(&self, candidates: Vec<(String, RankerContext)>) -> Vec<RankedItem> {
// Parallel Scoring via Rayon
let mut results: Vec<RankedItem> = candidates
.into_par_iter()
.map(|(id, ctx)| {
// 1. Model Inference
let p_click = self.predict_ctr(&ctx);
let p_conv = self.predict_cvr(&ctx);
let expected_revenue = p_conv * ctx.item_price;
// 2. Fusion Logic (Weighted Sum)
// Score = w1*CTR + w2*CVR + w3*Rev
let final_score =
self.weights.click_weight * p_click +
self.weights.conversion_weight * p_conv +
self.weights.revenue_weight * expected_revenue;
let mut scores = HashMap::new();
scores.insert("p_click".to_string(), p_click);
scores.insert("p_conv".to_string(), p_conv);
RankedItem {
item_id: id,
final_score,
scores,
}
})
.collect();
// 3. Sort Descending
results.sort_by(|a, b| b.final_score.partial_cmp(&a.final_score).unwrap());
results
}
}
}
Infrastructure: Real-Time Feature Store
The Ranker needs inputs. It cannot compute “User Avg Spend in last 30d” on the fly. It fetches it from Redis.
redis_feature_store.rs:
#![allow(unused)]
fn main() {
use redis::Commands;
pub fn fetch_features(con: &mut redis::Connection, user_id: &str, item_ids: &[String]) -> Vec<f32> {
let mut pipe = redis::pipe();
// Pipelined HGET to minimize RTT (Round Trip Time)
for item in item_ids {
let key = format!("feature:item:{}", item);
pipe.hget(key, "ctr");
}
// Execute all commands in one go
let results: Vec<f32> = pipe.query(con).unwrap();
results
}
}
Calibration: Trusting the Probabilities
The fusion formula assumes $P(Click)$ is a real probability (0.0 to 1.0). Ranking models (especially Tree models) output uncalibrated scores. If Model A says 0.8 and Model B says 0.2, but Model A is uncalibrated and really means 0.3, your fusion is broken.
Isotonic Regression
We fit a monotonic function $f(score) \rightarrow true_prob$ on a holdout set.
- Binning: Group items by score (e.g., 0.1-0.2).
- Counting: Calculate actual CTR in that bin.
- Mapping: Create a lookup table.
Rust Implementation (Isotonic Inference):
#![allow(unused)]
fn main() {
fn calibrate(raw_score: f32, calibration_map: &[(f32, f32)]) -> f32 {
// Binary search to find bin
// Linear interpolation
0.5 // placeholder
}
}
Case Study: Ads Ranking
In Ads, Ranking is not just relevance. It is an Auction. $$ Score = Bid \times P(Click) $$ This is eCPM (Expected Cost Per Mille).
- GSP (Generalized Second Price): The winner pays the price of the second highest bidder, divided by their own quality score.
- Quality Score: $P(Click)$.
- Result: High quality ads pay less for the same position.
Troubleshooting: Ranking Issues
Scenario 1: Feature Leakage (Gifts from future)
- Symptom: Offline AUC is 0.99, Online AUC is 0.60.
- Cause: You included a feature like
total_clickswhich included clicks from after the prediction time in your training set. - Fix: Use “Point-in-Time” joins in your Feature Store.
Scenario 2: Rank Reversals
- Symptom: Item A > B. Add Item C. Now B > A.
- Cause: Listwise Loss function instability (Softmax normalization over the whole list).
- Fix: Switch to Pairwise (BPR) or ensure consistent batch sizes.
Scenario 3: Calibrated Probabilities drift
- Symptom: Fusion weights stop working.
- Cause: Data distribution changed (Christmas shopping). Calibration map is stale.
- Fix: Re-calibrate daily on the last 24h of data.
MLOps Interview Questions
-
Q: What is the difference between Pointwise, Pairwise, and Listwise ranking? A: Pointwise predicts score $s_i$ independently (Regression). Pairwise predicts $s_i > s_j$ (Classification). Listwise optimizes the entire permutation (NDCG). Listwise is best but hardest to train.
-
Q: How do you handle “Position Bias” in ranking training? A: Add
positionas a feature during training. During inference, setposition=0(top rank) for all items. This teaches the model to predict the click probability as if the item were at the top. -
Q: Why use Multi-Task Learning instead of separate models? A: 1. Saves inference compute (Shared Encoder). 2. Regularization (Learning CVR helps learn CTR because features are shared). 3. Solves Data Sparsity for lower-funnel tasks (e.g. Purchase) by leveraging high-volume tasks (Click).
-
Q: What is “Calibration”? A: Ensuring that if the model predicts 0.7, the event happens 70% of the time. Crucial for MOO (Combiniing scores) and Bidding.
-
Q: How do you debug “Rank Reversals”? A: If item A > B, but adding item C makes B > A. This happens in Softmax-based listwise losses. Check consistency of the scoring function.
Glossary
- LTR (Learning to Rank): ML technique to optimize the order of a list.
- MOO (Multi-Objective Optimization): Balancing conflicting goals (Clicks vs Revenue).
- MMOE (Multi-Gate Mixture-of-Experts): Neural architecture for MTL resolving task conflicts.
- NDCG: Metric rewarding relevant items appearing higher in the list.
- Cross-Encoder: Model processing User and Item features jointly (Slow, Accurate).
- eCPM: Expected Cost Per Mille (1000 impressions). Ad ranking metric.
Summary Checklist
- Calibration: Always calibrate model outputs before improving fusion weights.
- Recall vs Precision: Don’t use Accuracy. Use NDCG@10 or MRR.
- Feature Consistency: Ensure specific features (e.g., User Age) are available at inference time with <5ms latency.
- Shared Bottom: Start with a Shared-Bottom MTL model for CTR/CVR. Move to MMOE if tasks conflict heavily.
- Business Rules: Keep the final “Re-Ranking” logic (filtering illegal items, boosting sponsored) separate from the ML Ranker score.
- Logging: Log the
final_scoreand allsub_scoresfor offline analysis of the fusion weights. - Latency: Ranking must happen in < 50ms. Use CPU-optimized trees (XGBoost/LightGBM) or distilled Student networks.
- Features: Use “Interaction Features” (e.g., “User Category” x “Item Category” match).
- Warm-up: When deploying a new Ranker, run in Shadow Mode to verify calibration before enabling actions.
- Explainability: Use SHAP values on the Ranker to understand why item X was #1.