37.1. Backtesting Frameworks & Temporal Validation

Time series forecasting is the only domain in Machine Learning where you can perform perfect cross-validation and still fail in production with 100% certainty. The reason is simple: K-Fold Cross-Validation, the bread and butter of generic ML, is fundamentally broken for temporal data. It allows the model to “peek” into the future.

This chapter dismantles traditional validation methods and builds a rigorous Backtesting Framework, implemented in Rust for high-throughput performance.

The Cardinal Sin: Look-Ahead Bias

Imagine you are training a model to predict stock prices. If you use standard 5-Fold CV:

Fold 1: Train on [Feb, Mar, Apr, May], Test on [Jan].
Fold 2: Train on [Jan, Mar, Apr, May], Test on [Feb].

In Fold 1, the model learns from May prices to predict January prices. This is impossible in reality. The model effectively learns “if the price in May is high, the price in January was likely rising,” which is a causal violation.

Rule Zero of Time Series MLOps: Validation sets must always chronologically follow training sets.

Anatomy of a Leak: The Catalog of Shame

Leaks aren’t always obvious. Here are the most common ones found in production audits:

1. The Global Scaler Leak

Scenario: computing StandardScaler on the entire dataset before splitting into train/test. Mechanism: The mean and variance of the future (Test Set) are embedded in the scaled values of the past (Train Set). Fix: Fit scalers ONLY on the training split of the current fold.

2. The Lag Leak

Scenario: Creating lag_7 feature before dropping rows or doing a random split. Mechanism: If you index row $t=100$, its lag_7 is $t=93$. This is fine. But if you have forward_7_day_avg (a common “label” feature) and accidentally include it as input, you destroy the backtest. Fix: Feature engineering pipelines must strictly refuse to look at $t > T_{now}$.

3. The “Corrected Data” Leak (Vintage Leak)

Scenario: Data Engineering fixes a data error from January in March. Mechanism: You backtest a model for February. You use the corrected January data. Reality: The model running in February would have seen the erroneous data. Your backtest is optimistic. Fix: Use a Bi-Temporal Feature Store (Transaction Time vs Event Time).

Backtesting Architectures

Instead of K-Fold, we use Rolling Origin Evaluation (also known as Walk-Forward Validation).

1. Expanding Window (The “Cumulative” Approach)

We fix the starting point and move the validation boundary forward.

Split 1: Train [Jan-Mar], Test [Apr]
Split 2: Train [Jan-Apr], Test [May]
Split 3: Train [Jan-May], Test [Jun]

Pros: utilizes all available historical data. Good for “Global Models” (like transformers) that hunger for data. Cons: Training time grows linearly with each split. Older data (Jan) might be irrelevant if the regime has changed (Concept Drift).

2. Sliding Window (The “Forgetful” Approach)

We fix the window size. As we add a new month, we drop the oldest month.

Split 1: Train [Jan-Mar], Test [Apr]
Split 2: Train [Feb-Apr], Test [May]
Split 3: Train [Mar-May], Test [Jun]

Pros: Constant training time. Adapts quickly to new regimes by forgetting obsolete history. Cons: May discard valuable long-term seasonality signals.

3. The “Gap” (Simulating Production Latency)

In real pipelines, data doesn’t arrive instantly.

Scenario: You forecast T+1 on Monday morning.
Reality: The ETL pipeline for Sunday’s data finishes on Monday afternoon.
Constraint: You must predict using data up to Saturday, not Sunday.
The Gap: You must insert a 1-step (or N-step) buffer between Train and Test sets to simulate this latency. Failing to do so leads to “optimistic” error metrics that vanish in production.

       [Train Data]       [Gap] [Test Data]
       |------------------|-----|---------|
Day:   0                  99    100       107
Event: [History Available] [ETL] [Forecast]

Metrics that Matter

RMSE is not enough. You need metrics that handle scale differences (selling 10 units vs 10,000 units).

MAPE (Mean Absolute Percentage Error)

$$ MAPE = \frac{100%}{n} \sum \left| \frac{y - \hat{y}}{y} \right| $$

Problem: Explodes if $y=0$. Penalizes under-forecasts more than over-forecasts.

SMAPE (Symmetric MAPE)

Bounded between 0% and 200%. Handles zeros better but still biased.

MASE (Mean Absolute Scaled Error)

The Gold Standard. It compares your model’s error to the error of a “Naïve Forecast” (predicting the previous value). $$ MASE = \frac{MAE}{MAE_{naive}} $$

MASE < 1: Your model is better than guessing “tomorrow = today”.
MASE > 1: Your model is worse than a simple heuristic. Throw it away.

Pinball Loss (Quantile Loss)

For probabilistic forecasts (e.g., “sales will be between 10 and 20”), we use Pinball Loss. $$ L_\tau(y, \hat{y}) = \begin{cases} (y - \hat{y})\tau & \text{if } y \ge \hat{y} \ (\hat{y} - y)(1-\tau) & \text{if } y < \hat{y} \end{cases} $$

Rust Implementation: High-Performance Backtesting Engine

Backtesting involves training and scoring hundreds of models. Python’s overhead (looping over splits) adds up. We can build a vectorized Backtester in Rust using polars.

Project Structure

To productionize this, we structure the Rust project as a proper crate.

# Cargo.toml
[package]
name = "backtest-cli"
version = "0.1.0"
edition = "2021"

[dependencies]
clap = { version = "4.0", features = ["derive"] }
polars = { version = "0.36", features = ["lazy", "parquet", "serde"] }
chrono = "0.4"
rayon = "1.7"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
anyhow = "1.0"

The CLI Architecture

A robust backtester is a CLI tool, not a script. It should take a config and output a report.

// main.rs
use clap::Parser;
use polars::prelude::*;
use std::fs::File;

#[derive(Parser, Debug)]
#[command(author, version, about)]
struct Args {
    /// Path to the parquet file containing the series
    #[arg(short, long)]
    input: String,

    /// Number of folds for cross-validation
    #[arg(short, long, default_value_t = 5)]
    folds: usize,

    /// Gap size in days (latency simulation)
    #[arg(short, long, default_value_t = 1)]
    gap: i64,

    /// Output path for the JSON report
    #[arg(short, long)]
    output: String,
}

fn main() -> anyhow::Result<()> {
    let args = Args::parse();
    
    // 1. Load Data with Polars LazyFrame for efficiency
    let df = LazyFrame::scan_parquet(&args.input, ScanArgs::default())?
        .collect()?;
        
    println!("Loaded dataframe with {} rows", df.height());

    // 2. Initialize Cross Validator
    // (See impl details below)
    let cv = SlidingWindow::new(Duration::days(365), Duration::days(7), Duration::days(1));

    // 3. Run Backtest
    // run_backtest(&df, &cv);

    Ok(())
}

The `Splitter` Trait

#![allow(unused)]
fn main() {
use chrono::{NaiveDate, Duration};

pub struct TimeSplit {
    pub train_start: NaiveDate,
    pub train_end: NaiveDate,
    pub test_start: NaiveDate,
    pub test_end: NaiveDate,
}

pub trait CrossValidator {
    fn split(&self, start: NaiveDate, end: NaiveDate) -> Vec<TimeSplit>;
}

pub struct SlidingWindow {
    pub train_size: Duration,
    pub test_size: Duration,
    pub step: Duration,
    pub gap: Duration,
}

impl CrossValidator for SlidingWindow {
    fn split(&self, total_start: NaiveDate, total_end: NaiveDate) -> Vec<TimeSplit> {
        let mut splits = Vec::new();
        let mut current_train_start = total_start;
        
        loop {
            let current_train_end = current_train_start + self.train_size;
            let current_test_start = current_train_end + self.gap;
            let current_test_end = current_test_start + self.test_size;

            if current_test_end > total_end {
                break;
            }

            splits.push(TimeSplit {
                train_start: current_train_start,
                train_end: current_train_end,
                test_start: current_test_start,
                test_end: current_test_end,
            });

            current_train_start += self.step;
        }
        splits
    }
}
}

Vectorized Evaluation with Polars

Instead of iterating rows, we filter the DataFrame using masks. This leverages SIMD instructions.

#![allow(unused)]
fn main() {
use polars::prelude::*;

pub fn evaluate_split(
    df: &DataFrame, 
    split: &TimeSplit
) -> PolarsResult<(f64, f64)> { // Returns (RMSE, MASE)
    
    // 1. Masking (Zero-Copy)
    let date_col = df.column("date")?.date()?;
    
    // Train Mask: start <= date < end
    let train_mask = date_col.gt_eq(split.train_start) & date_col.lt(split.train_end);
    let train_df = df.filter(&train_mask)?;
    
    // Test Mask
    let test_mask = date_col.gt_eq(split.test_start) & date_col.lt(split.test_end);
    let test_df = df.filter(&test_mask)?;

    // 2. Train Model (ARIMA / XGBoost wrapper)
    // For this example, let's assume a simple Moving Average model
    let y_train = train_df.column("y")?.f64()?;
    let mean_val = y_train.mean().unwrap_or(0.0);
    
    // 3. Predict
    let y_true = test_df.column("y")?.f64()?;
    // Prediction is just the mean
    let y_pred = Float64Chunked::full("pred", mean_val, y_true.len());
    
    // 4. Calculate Metrics
    let err = (y_true - &y_pred).abs();
    let mae = err.mean().unwrap_or(0.0);
    
    // Calculate Naive MAE for MASE
    // Shift train Y by 1
    let y_train_s = Series::new("y", y_train);
    let y_t = y_train_s.slice(1, y_train_s.len()-1);
    let y_t_minus_1 = y_train_s.slice(0, y_train_s.len()-1);
    let naive_mae = (y_t - y_t_minus_1).abs()?.mean().unwrap_or(1.0); // Avoid div/0
    
    let mase = mae / naive_mae;
    
    Ok((0.0, mase)) // Placeholder RMSE
}
}

Parallelizing Backtests with Rayon

Since each split is independent, backtesting is embarrassingly parallel.

#![allow(unused)]
fn main() {
use rayon::prelude::*;

pub fn run_backtest(df: &DataFrame, cv: &impl CrossValidator) {
    let splits = cv.split(
        NaiveDate::from_ymd(2023, 1, 1), 
        NaiveDate::from_ymd(2024, 1, 1)
    );

    let results: Vec<_> = splits.par_iter()
        .map(|split| {
            // Each thread gets a read-only view of the DataFrame
            match evaluate_split(df, split) {
                Ok(res) => Some(res),
                Err(e) => {
                    eprintln!("Split failed: {}", e);
                    None
                }
            }
        })
        .collect();

    // Aggregate results...
}
}

Walk-Forward Optimization (Hyperparameter Grid Search)

Backtesting isn’t just about evaluation; it’s about selection. How do we find the best alpha for Exponential Smoothing? We run the backtest for every combination of parameters.

The Grid Search Loop:

#![allow(unused)]
fn main() {
struct Hyperparams {
    alpha: f64,
    beta: f64,
}

let alphas = vec![0.1, 0.5, 0.9];
let betas = vec![0.1, 0.3];

// Cartesian Product
let grid: Vec<Hyperparams> = iproduct!(alphas, betas)
    .map(|(a, b)| Hyperparams { alpha: a, beta: b })
    .collect();

// Parallel Grid Search
let best_model = grid.par_iter()
    .map(|params| {
        let score = run_backtest_for_params(df, params);
        (score, params)
    })
    .min_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
}

Reporting and Visualization

A backtest running in the terminal is opaque. We need visual proof. Since we are using Rust, we can generate a JSON artifact compatible with Python plotting libraries or Vega-Lite.

The JSON Report Schema

{
  "summary": {
    "total_folds": 50,
    "mean_mase": 0.85,
    "std_mase": 0.12,
    "p95_error": 1450.20
  },
  "folds": [
    {
      "train_end": "2023-01-01",
      "test_end": "2023-01-08",
      "mase": 0.78,
      "rmse": 120.5
    },
    ...
  ]
}

Generating Plots (Python Sidecar)

We recommend a small Makefile step to plot this immediately.

# scripts/plot_backtest.py
import json
import pandas as pd
import altair as alt

data = json.load(open("backtest_report.json"))
df = pd.DataFrame(data["folds"])

chart = alt.Chart(df).mark_line().encode(
    x='train_end:T',
    y='mase:Q',
    tooltip=['mase', 'rmse']
).properties(title="Backtest Consistency Over Time")

chart.save("backtest_consistency.html")

The “Retrain vs Update” Dilemma

In production, do you retrain the whole model every day?

Full Retrain: Expensive. Required for Deep Learning models to learn new high-level features or if the causal structure changes significantly.
Incremental Update (Online Learning): Cheap. Just update the weights with the new gradient. Supported by River (Python) or customized Rust implementations.
Refit: Keep hyperparameters fixed, but re-estimate coefficients (e.g., in ARIMA or Linear Regression).

Recommendation:

Weekly: Full Retrain (Hyperparameter Search / Grid Search).
Daily: Refit (Update coefficients on new data).

Handling Holidays and Special Events

Time series are driven by calendars.

Problem: “Easter” moves every year.
Solution: Do not rely on day_of_year features alone. You must join with a “Holiday Calendar” feature store.

Rust Crate: holo (Holidays)

#![allow(unused)]
fn main() {
// use holo::Calendar;
// Pseudo-code
let cal = Calendar::US;
if cal.is_holiday(date) {
    // Add "is_holiday" feature
}
}

Case Study: The Billion Dollar Leak

A major hedge fund once deployed a model that predicted stock prices with 99% accuracy. The bug: They normalized the data using Max(Price) for the entire year.

On Jan 1st, if the price was 100 and the Max was 200 (in Dec), the input was 0.5.
The model learned “If input is 0.5, buy, because it will double by Dec.”
In production, the Max was unknown. They used Max(Jan) = 100. The input became 1.0.
The model crashed. Lesson: Never compute global statistics before splitting.

Troubleshooting Guide

Error	Cause	Fix
MASE > 1.0	Model is worse than random walk	Check for insufficient history or noisy data. Switch to exponential smoothing.
Backtest 99% Acc, Prod 50%	Leakage	Audit features for `forward_fill` or `lead` usage. Check timestamp alignment.
Polars OOM	Dataset too large	Use `LazyFrame` and verify `streaming=True` is enabled in `collect()`.
Threads Stuck	Rayon Deadlock	Ensure no mutex locks are held across thread boundaries in the `evaluate_split` function.

Glossary

Horizon: How far into the future we predict (e.g., 7 days).
Cutoff: The last timestamp of training data.
Gap: The time between Cutoff and the first Prediction.
Vintage: The version of the data as it existed at a specific time (before corrections).

Advanced: Cutoff Data Generation

For a truly robust backtest, you shouldn’t just “split” the data. You should reconstruct the “State of the World” at each cutoff point. Why? Retroactive corrections. Data Engineering teams often fix data errors weeks later. “Actually, sales on Jan 1st were 50, not 40.” If you backtest using the corrected value (50), you are cheating. The model running on Jan 2nd only saw 40.

The “Vintage” Data Model: Store every row with (valid_time, record_time).

valid_time: When the event happened.
record_time: When the system knew about it.

Your backtester must filter record_time <= cutoff_date.

Advanced: Purged K-Fold

Financial Time Series often have “embargo” periods. If predicting stock returns, the label for $T$ might be known at $T+1$, but the impact of the trade lasts until $T+5$. Purging: We delete samples from the training set that overlap with the test set’s label look-ahead window. Embargo: We delete samples immediately following the test set to prevent “correlation leakage” if serial correlation is high.

Conclusion

Backtesting is not just a sanity check; it is the Unit Test of your Forecasting capability. If you cannot reliably reproduce your backtest results, you do not have a model; you have a random number generator. By implementing this rigorous framework in Rust, we achieve the speed necessary to run exhaustive tests (Grid Search, Rolling Windows) on every commit, effectively creating a CI/CD pipeline for Finance.

The MLOps Omni-Reference