39.4. Ranking & Multi-Objective Optimization

Status: Draft Version: 1.0.0 Tags: #RecSys, #Ranking, #Rust, #MultiTask, #XGBoost Author: MLOps Team

The Ranker: Where Precision Matters
Ranking Architecture: Cross-Encoders
Learning to Rank (LTR) Objectives
Multi-Objective Optimization (MOO)
Architecture: MMOE (Multi-Gate Mixture-of-Experts)
Rust Implementation: The Scoring Engine
Infrastructure: Real-Time Feature Store
Calibration: Trusting the Probabilities
Case Study: Ads Ranking
Troubleshooting: Ranking Issues
MLOps Interview Questions
Glossary
Summary Checklist

Prerequisites

Before diving into this chapter, ensure you have the following installed:

Rust: 1.70+ (redis, rayon)
XGBoost: Understanding of Gradient Boosting Trees.
Redis: For feature lookup.

The Ranker: Where Precision Matters

The Retrieval stage gave us 1,000 candidates. Now we have the budget to be smart. We can use Cross-Encoders (Deep Networks that take User and Item features together) and massive Feature Engineering.

Stage	Count	Latency/Item	Model	Input
Retrieval	100,000,000	10ns	Dot Product (ANN)	ID, Embeddings
Ranking	1,000	10us	XGBoost / MLP	User History, Context, Item Stats
Re-Ranking	50	1ms	Transformers	Business Logic

Ranking Architecture: Cross-Encoders

Retrieval used Bi-Encoders (Two-Tower). Ranking uses Cross-Encoders.

Retrieval (Bi-Encoder): $$ Score = BERT(User) \cdot BERT(Item) $$ Fast. Missing interactions. User and Item never see each other until the end.

Ranking (Cross-Encoder): $$ Score = BERT(Concat(User, Item)) $$ The network sees both sets of features in the first layer. It can capture non-linear interactions:

“Harry Potter” implies “Fantasy”.
User loves “Fantasy”.
Therefore -> Match.

For 1000 items, Cross-Encoder inference takes ~50ms on GPU. This is acceptable for Ranking, but not Retrieval.

Learning to Rank (LTR) Objectives

How do we train the Ranker? We don’t just want “Click / No Click”. We want “Item A is better than Item B”.

1. Pointwise

Treat it as Binary Classification. $L = LogLoss(y, \hat{y})$.

Pros: Simple. Calibrated probabilities.
Cons: Ignores sorting. Predicting 0.4 vs 0.5 for items at rank 1000 is useless hard work.

2. Pairwise (RankNet, BPR)

Input: $(u, i, j)$ where $i$ is clicked, $j$ is not. Output: $\hat{s}_i > \hat{s}_j$.

Loss: $L = -\log \sigma(\hat{s}_i - \hat{s}_j)$.
Pros: Optimizes the ordering directly.

3. Listwise (LambdaRank, LambdaMART)

Optimize the NDCG (Normalized Discounted Cumulative Gain) directly. Gradients are weighed by the change in NDCG if items were swapped.

Lambda Gradient: $\lambda_{ij} = \frac{\partial \Delta NDCG}{\partial s_i} \cdot \frac{1}{1 + e^{s_i - s_j}}$

Multi-Objective Optimization (MOO)

Engagement is not enough. “Clickbait” gets high clicks but low dwell time. We have multiple targets:

Click (CTR): $P(Click)$
Conversion (CVR): $P(Buy | Click)$
Dwell Time: $E[Seconds]$

The Fusion Formula

We train a multi-head model (one head per task). Then we combine them at inference.

$$ Score = w_1 \cdot P(Click) + w_2 \cdot P(Buy) \cdot V(Price) + w_3 \cdot \log(Dwell) $$

The weights $w_i$ are business decisions (“We value a purchase as 100x a click”).

Architecture: MMOE (Multi-Gate Mixture-of-Experts)

For conflicting tasks (e.g., “Click” vs “Dwell Time”), a Shared Bottom fails because optimization gradients cancel out. MMOE uses “Experts” (Sub-networks) and “Gates” (Attention mechanisms) to route tasks to relevant experts.

      Task A (CTR) Output       Task B (CVR) Output
             ^                          ^
             |                          |
      +------+-------+          +-------+------+
      | Gate Network |          | Gate Network |
      +------+-------+          +-------+------+
             ^                          ^
             |                          |
      +------+--------------------------+------+
      | Softmax Weights over Experts           |
      +------+-----------+-----------+---------+
             |           |           |
        [Expert 1]  [Expert 2]  [Expert 3]
             ^           ^           ^
             +-----------+-----------+
                         |
                    Input Features

Rust Implementation: The Scoring Engine

In production, the Ranker is CPU-bound. We need highly optimized feature crossing and dot products.

Project Structure

ranker-engine/
├── Cargo.toml
└── src/
    └── lib.rs

Cargo.toml:

[package]
name = "ranker-engine"
version = "0.1.0"
edition = "2021"

[dependencies]
rayon = "1.7"
serde = { version = "1.0", features = ["derive"] }
redis = "0.23"

src/lib.rs:

#![allow(unused)]
fn main() {
//! High-Performance Scoring Engine (Ranker).
//! Handles Feature Crossing, Model Inference, and Multi-Objective Fusion.

use rayon::prelude::*;
use std::collections::HashMap;

/// A dense feature vector for a single candidate item.
/// In production, this would be a flat float array or a sparse vector.
#[derive(Debug, Clone)]
pub struct RankerContext {
    pub user_age: f32,
    pub user_hist_ctr: f32,
    pub item_price: f32,
    pub item_ctr: f32,
    // ... 100s of features
}

#[derive(Debug, Clone)]
pub struct RankedItem {
    pub item_id: String,
    pub final_score: f32,
    pub scores: HashMap<String, f32>, // sub-scores for debugging
}

/// Weights defined by Product Manager dynamically in ZooKeeper/Consul
pub struct ObjectiveWeights {
    pub click_weight: f32,
    pub conversion_weight: f32,
    pub revenue_weight: f32,
}

pub struct ScoringEngine {
    weights: ObjectiveWeights,
}

impl ScoringEngine {
    pub fn new(weights: ObjectiveWeights) -> Self {
        Self { weights }
    }

    /// Mock prediction function (Replace with ONNX call or Tree Traversal)
    /// This simulates the "Shared Bottom" output being routed to a head.
    fn predict_ctr(&self, ctx: &RankerContext) -> f32 {
        // Logistic Sigmoid
        let logit = 0.5 * ctx.user_hist_ctr + 0.5 * ctx.item_ctr; 
        1.0 / (1.0 + (-logit).exp())
    }

    fn predict_cvr(&self, ctx: &RankerContext) -> f32 {
        let logit = -0.1 * ctx.item_price + 0.1 * ctx.user_age; 
        1.0 / (1.0 + (-logit).exp())
    }

    /// Parallel Ranking of Candidates.
    /// Uses Rayon to split the workload across CPU cores.
    pub fn rank(&self, candidates: Vec<(String, RankerContext)>) -> Vec<RankedItem> {
        // Parallel Scoring via Rayon
        let mut results: Vec<RankedItem> = candidates
            .into_par_iter()
            .map(|(id, ctx)| {
                // 1. Model Inference
                let p_click = self.predict_ctr(&ctx);
                let p_conv = self.predict_cvr(&ctx);
                let expected_revenue = p_conv * ctx.item_price;

                // 2. Fusion Logic (Weighted Sum)
                // Score = w1*CTR + w2*CVR + w3*Rev
                let final_score = 
                    self.weights.click_weight * p_click +
                    self.weights.conversion_weight * p_conv + 
                    self.weights.revenue_weight * expected_revenue;

                let mut scores = HashMap::new();
                scores.insert("p_click".to_string(), p_click);
                scores.insert("p_conv".to_string(), p_conv);

                RankedItem {
                    item_id: id,
                    final_score,
                    scores, 
                }
            })
            .collect();

        // 3. Sort Descending
        results.sort_by(|a, b| b.final_score.partial_cmp(&a.final_score).unwrap());
        results
    }
}
}

Infrastructure: Real-Time Feature Store

The Ranker needs inputs. It cannot compute “User Avg Spend in last 30d” on the fly. It fetches it from Redis.

redis_feature_store.rs:

#![allow(unused)]
fn main() {
use redis::Commands;

pub fn fetch_features(con: &mut redis::Connection, user_id: &str, item_ids: &[String]) -> Vec<f32> {
    let mut pipe = redis::pipe();
    
    // Pipelined HGET to minimize RTT (Round Trip Time)
    for item in item_ids {
        let key = format!("feature:item:{}", item);
        pipe.hget(key, "ctr");
    }
    
    // Execute all commands in one go
    let results: Vec<f32> = pipe.query(con).unwrap();
    results
}
}

Calibration: Trusting the Probabilities

The fusion formula assumes $P(Click)$ is a real probability (0.0 to 1.0). Ranking models (especially Tree models) output uncalibrated scores. If Model A says 0.8 and Model B says 0.2, but Model A is uncalibrated and really means 0.3, your fusion is broken.

Isotonic Regression

We fit a monotonic function $f(score) \rightarrow true_prob$ on a holdout set.

Binning: Group items by score (e.g., 0.1-0.2).
Counting: Calculate actual CTR in that bin.
Mapping: Create a lookup table.

Rust Implementation (Isotonic Inference):

#![allow(unused)]
fn main() {
fn calibrate(raw_score: f32, calibration_map: &[(f32, f32)]) -> f32 {
    // Binary search to find bin
    // Linear interpolation
    0.5 // placeholder
}
}

Case Study: Ads Ranking

In Ads, Ranking is not just relevance. It is an Auction. $$ Score = Bid \times P(Click) $$ This is eCPM (Expected Cost Per Mille).

GSP (Generalized Second Price): The winner pays the price of the second highest bidder, divided by their own quality score.
Quality Score: $P(Click)$.
Result: High quality ads pay less for the same position.

Troubleshooting: Ranking Issues

Scenario 1: Feature Leakage (Gifts from future)

Symptom: Offline AUC is 0.99, Online AUC is 0.60.
Cause: You included a feature like total_clicks which included clicks from after the prediction time in your training set.
Fix: Use “Point-in-Time” joins in your Feature Store.

Scenario 2: Rank Reversals

Symptom: Item A > B. Add Item C. Now B > A.
Cause: Listwise Loss function instability (Softmax normalization over the whole list).
Fix: Switch to Pairwise (BPR) or ensure consistent batch sizes.

Scenario 3: Calibrated Probabilities drift

Symptom: Fusion weights stop working.
Cause: Data distribution changed (Christmas shopping). Calibration map is stale.
Fix: Re-calibrate daily on the last 24h of data.

MLOps Interview Questions

Q: What is the difference between Pointwise, Pairwise, and Listwise ranking? A: Pointwise predicts score $s_i$ independently (Regression). Pairwise predicts $s_i > s_j$ (Classification). Listwise optimizes the entire permutation (NDCG). Listwise is best but hardest to train.
Q: How do you handle “Position Bias” in ranking training? A: Add position as a feature during training. During inference, set position=0 (top rank) for all items. This teaches the model to predict the click probability as if the item were at the top.
Q: Why use Multi-Task Learning instead of separate models? A: 1. Saves inference compute (Shared Encoder). 2. Regularization (Learning CVR helps learn CTR because features are shared). 3. Solves Data Sparsity for lower-funnel tasks (e.g. Purchase) by leveraging high-volume tasks (Click).
Q: What is “Calibration”? A: Ensuring that if the model predicts 0.7, the event happens 70% of the time. Crucial for MOO (Combiniing scores) and Bidding.
Q: How do you debug “Rank Reversals”? A: If item A > B, but adding item C makes B > A. This happens in Softmax-based listwise losses. Check consistency of the scoring function.

Glossary

LTR (Learning to Rank): ML technique to optimize the order of a list.
MOO (Multi-Objective Optimization): Balancing conflicting goals (Clicks vs Revenue).
MMOE (Multi-Gate Mixture-of-Experts): Neural architecture for MTL resolving task conflicts.
NDCG: Metric rewarding relevant items appearing higher in the list.
Cross-Encoder: Model processing User and Item features jointly (Slow, Accurate).
eCPM: Expected Cost Per Mille (1000 impressions). Ad ranking metric.

Summary Checklist

Calibration: Always calibrate model outputs before improving fusion weights.
Recall vs Precision: Don’t use Accuracy. Use NDCG@10 or MRR.
Feature Consistency: Ensure specific features (e.g., User Age) are available at inference time with <5ms latency.
Shared Bottom: Start with a Shared-Bottom MTL model for CTR/CVR. Move to MMOE if tasks conflict heavily.
Business Rules: Keep the final “Re-Ranking” logic (filtering illegal items, boosting sponsored) separate from the ML Ranker score.
Logging: Log the final_score and all sub_scores for offline analysis of the fusion weights.
Latency: Ranking must happen in < 50ms. Use CPU-optimized trees (XGBoost/LightGBM) or distilled Student networks.
Features: Use “Interaction Features” (e.g., “User Category” x “Item Category” match).
Warm-up: When deploying a new Ranker, run in Shadow Mode to verify calibration before enabling actions.
Explainability: Use SHAP values on the Ranker to understand why item X was #1.

The MLOps Omni-Reference