38.3. Offline RL & Counterfactual Evaluation

Status: Draft Version: 1.0.0 Tags: #RLOps, #OfflineRL, #OPE, #Rust Author: MLOps Team

The Core Problem: Distribution Shift
Off-Policy Evaluation (OPE)
Importance Sampling (IS)
Doubly Robust (DR) Estimation
Conservative Q-Learning (CQL)
Dataset Curation Pipeline
The OPE Dashboard
Visualizing Propensity Overlap
Future Directions: Decision Transformers
Glossary
Summary Checklist

Prerequisites

Before diving into this chapter, ensure you have the following installed:

Rust: 1.70+
Parquet Tools: For inspecting logs (parquet-tools)
Python: 3.10+ (NumPy, Matplotlib)
WandB/MLflow: For tracking experiments

The Core Problem: Distribution Shift

Online RL is dangerous. If you deploy a random agent to a datacenter cooling system to “learn,” it will overheat the servers. Offline RL (Batch RL) allows us to learn policies from historical logs (generated by a human or a heuristic version) without interacting with the environment.

The Problem Visualized

      +-------------------+
      |  Behavior Policy  |  (Safe, Boring)
      |     (Pi_Beta)     |
      +-------------------+
             /     \
            /       \   <--- Overlap Area (Safe to Learn)
           /         \
+-------------------------+
|      Target Policy      |  (Aggressive, Unknown)
|       (Pi_Theta)        |
+-------------------------+
           \         /
            \       /   <--- OOD Area (Danger Zone!)
             \     /
              \   /

Behavior Policy ($\pi_{\beta}$): The policy that generated the historical data. (e.g., The existing rule-based system).
Target Policy ($\pi_{\theta}$): The new neural network we want to evaluate.

If $\pi_{\theta}$ suggests an action $a$ that was never taken by $\pi_{\beta}$, we have no way to know the reward. We are flying blind. This is known as the Distribution Shift problem.

Log Everything!!

For Offline RL to work, your production logger MUST record:

State ($s_t$): The features seen.
Action ($a_t$): The action taken.
Reward ($r_t$): The outcome.
Propensity ($P(a_t|s_t)$): The probability that the old policy assigned to this action.
- Without Propensity, OPE is mathematically impossible.

Off-Policy Evaluation (OPE)

How do we estimate $V(\pi_{\theta})$ using only data from $\pi_{\beta}$?

1. Importance Sampling (IS)

We re-weight the historical rewards based on how likely the new policy would have taken the same actions.

$$ V_{IS}(\pi_\theta) = \frac{1}{N} \sum_{i=1}^N \left( \prod_{t=0}^T \frac{\pi_\theta(a_t|s_t)}{\pi_\beta(a_t|s_t)} \right) \sum_{t=0}^T \gamma^t r_t $$

The Problem: The product of ratios (Importance Weights) has high variance. If $\pi_\theta$ differs a lot from $\pi_\beta$, the weights explode to infinity or zero.
Effective Sample Size (ESS): If weights explode, your ESS drops to 1. You are effectively estimating based on a single trajectory.

Python Implementation (Reference)

import numpy as np

def estimate_is_python(trajectory, target_policy):
    rho = 1.0
    v = 0.0
    gamma = 0.99
    
    for t, step in enumerate(trajectory):
        target_prob = target_policy.prob(step.state, step.action)
        weight = target_prob / step.behavior_prob
        rho *= weight
        v += rho * (gamma**t * step.reward)
        
    return v

Rust Implementation: Robust Estimator (PDIS)

We implement Per-Decision Importance Sampling (PDIS), which uses the fact that future actions do not affect past rewards, slightly reducing variance.

#![allow(unused)]
fn main() {
use std::f64;

#[derive(Debug, Clone)]
struct Step {
    state: Vec<f64>,
    action: usize,
    reward: f64,
    behavior_prob: f64, // pi_beta(a|s) from logs
}

#[derive(Debug, Clone)]
struct Trajectory {
    steps: Vec<Step>
}

// Target Policy Interface
trait Policy {
    fn prob(&self, state: &[f64], action: usize) -> f64;
}

pub struct PDISEstimator {
    gamma: f64,
    max_weight: f64, // Clipping
}

impl PDISEstimator {
    pub fn estimate(&self, traj: &Trajectory, target_policy: &impl Policy) -> f64 {
        let mut v = 0.0;
        let mut rho = 1.0; // Cumulative Importance Weight

        for (t, step) in traj.steps.iter().enumerate() {
            let target_prob = target_policy.prob(&step.state, step.action);
            
            // Avoid division by zero
            let b_prob = step.behavior_prob.max(1e-6);
            
            let weight = target_prob / b_prob;
            rho *= weight;
            
            // Safety Clipping (Critical for Production)
            if rho > self.max_weight {
                rho = self.max_weight;
            }
            
            v += rho * (self.gamma.powi(t as i32) * step.reward);
            
            // Optimization: If rho is effectively 0, stop trajectory
            if rho < 1e-6 {
                break;
            }
        }
        v
    }
}
}

Conservative Q-Learning (CQL)

Standard Q-Learning (DQN) fails offline because it overestimates values for Out-Of-Distribution (OOD) actions (“The optimizer curse”). It sees a gap in the data and assumes “Maybe there’s gold there!”.

Conservative Q-Learning (CQL) adds a penalty term to lower the Q-values of OOD actions.

$$ L(\theta) = L_{DQN}(\theta) + \alpha \cdot (\mathbb{E}{a \sim \pi\theta}[Q(s,a)] - \mathbb{E}{a \sim \pi\beta}[Q(s,a)]) $$

Interpretation: “If the behavior policy didn’t take action A, assume action A is bad unless proven otherwise.”

Rust CQL Loss Implementation

#![allow(unused)]
fn main() {
// Conservative Q-Learning Loss in Rust (Conceptual)
// Assumes use of a Tensor library like `candle` or `tch`
// Using pseudo-tensor syntax for clarity

pub fn cql_loss(
    q_values: &Tensor, 
    actions: &Tensor, 
    rewards: &Tensor, 
    next_q: &Tensor
) -> Tensor {
    // 1. Standard Bellman Error (DQN)
    // Target = r + gamma * max_a Q(s', a)
    let target = rewards + 0.99 * next_q.max_dim(1).0;
    
    // Pred = Q(s, a)
    let pred_q = q_values.gather(1, actions);
    
    let bellman_error = (pred_q - target).pow(2.0).mean();
    
    // 2. CQL Conservative Penalty
    // Minimize Q for random actions (push down OOD)
    // Maximize Q for data actions (keep true data high)
    
    let log_sum_exp_q = q_values.logsumexp(1); // Softmax-like total Q
    let data_q = pred_q;
    
    // Loss = Bellman + alpha * (logsumexp(Q) - Q_data)
    let cql_penalty = (log_sum_exp_q - data_q).mean();
    
    let alpha = 5.0; // Penalty weight
    bellman_error + alpha * cql_penalty
}
}

Dataset Curation Pipeline

Garbage In, Garbage Out is amplified in Offline RL. We need a robust parser to turn raw logs into Trajectories.

Log Schema (Parquet):

{
  "fields": [
    {"name": "episode_id", "type": "string"},
    {"name": "timestamp", "type": "int64"},
    {"name": "state_json", "type": "string"},
    {"name": "action_id", "type": "int32"},
    {"name": "reward", "type": "float"},
    {"name": "propensity_score", "type": "float"},
    {"name": "is_terminal", "type": "boolean"}
  ]
}

Rust Parser:

#![allow(unused)]
fn main() {
use parquet::file::reader::{FileReader, SerializedFileReader};
use std::fs::File;

pub fn load_dataset(path: &str) -> Vec<Trajectory> {
    let file = File::open(path).expect("Log file not found");
    let reader = SerializedFileReader::new(file).unwrap();
    
    let mut trajectories = Vec::new();
    let mut current_traj = Trajectory { steps: Vec::new() };
    
    // Iterate rows... (Simplified)
    // Real implementation involves complex error handling and schema validation
    
    trajectories
}
}

The OPE Dashboard

Your MLOps dashboard for RL shouldn’t just show “Training Curve”. It should show:

ESS (Effective Sample Size): “We effectively have 500 trajectories worth of data for this new policy.” If ESS < 100, do not deploy.
Coverage: “The new policy explores 80% of the state space covered by the historical logs.”
Lower Bound: “With 95% confidence, the new policy is at least better than the baseline.”

Visualizing Propensity Overlap (Python)

# scripts/plot_overlap.py
import matplotlib.pyplot as plt
import numpy as np

def plot_propensity(pi_beta_probs, pi_theta_probs):
    plt.figure(figsize=(10, 6))
    plt.hist(pi_beta_probs, bins=50, alpha=0.5, label='Behavior Policy')
    plt.hist(pi_theta_probs, bins=50, alpha=0.5, label='Target Policy')
    plt.title("Propensity Score Overlap")
    plt.xlabel("Probability of Action")
    plt.ylabel("Count")
    plt.legend()
    plt.grid(True, alpha=0.3)
    # Save
    plt.savefig("overlap.png")

# If the histograms don't overlap, OPE is invalid.
# You are trying to estimate regions where you have no data.

Glossary

Behavior Policy: The policy that generated the logs.
Target Policy: The policy we want to evaluate.
OPE (Off-Policy Evaluation): Estimating value without interacting.
Importance Sampling: Weighting samples by $\pi_\theta / \pi_\beta$.
CQL (Conservative Q-Learning): Algorithm that penalizes OOD Q-values.
ESS (Effective Sample Size): $N / (1 + Var(w))$. Measure of data quality.

Summary Checklist

Log Probabilities: Your logging system MUST log probability_of_action ($\pi_\beta(a|s)$). Without this, you cannot do importance sampling.
Overlap: Ensure $\pi_\theta$ has support where $\pi_\beta$ has support.
Warm Start: Initialize your policy with Behavioral Cloning (BC) on the logs before fine-tuning with RL. This ensures you start within the safe distribution.
Clip Weights: Always use Weighted Importance Sampling (WIS) or clipped IS to handle variance.
Reward Model: Train a State->Reward regressor to enable Doubly Robust estimation.
Negative Sampling: Ensure your dataset includes failures, otherwise the agent will overestimate safety.

The MLOps Omni-Reference