Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

37.3. Concept Drift in Sequential Data

In static learning (like ImageNet), a cat is always a cat. In Time Series, the rules of the game change constantly.

  • Inflation: $100 in 1990 is not $100 in 2024.
  • Seasonality: Sales in December are always higher than November. That’s not drift; that’s the calendar.
  • Drift: Suddenly, your “Monday Model” fails because a competitor launched a promo.

This chapter defines rigorous methods to detect real concept drift ($P(y|X)$ changes) while ignoring expected temporal variations ($P(X)$ changes).

The Definition of Sequential Drift

Drift is a change in the joint distribution $P(X, y)$. We decompose it:

  1. Covariate Shift ($P(X)$ changes): The inputs change. (e.g., more users are browsing on mobile). The model might still work if the function hasn’t changed.
  2. Concept Drift ($P(y|X)$ changes): The relationship changes. (e.g., users on mobile used to buy X, now they buy Y). The model is broken.
  3. Label Shift ($P(y)$ changes): The output distribution changes (e.g., suddenly everyone is buying socks).

The Seasonality Trap

If your error rate spikes every weekend, you don’t have drift; you have a missing feature (is_weekend). Rule: Always de-trend and de-seasonalize your metric before checking for drift. Method: Run drift detection on the Residuals ($y - \hat{y}$), not the raw values.

Detection Algorithms

We need algorithms that process data streams in $O(1)$ memory and time.

1. Page-Hinkley Test (PHT)

A cumulative sum monitoring variation. Good for detecting Abrupt Changes in the mean.

Rust Implementation:

#![allow(unused)]
fn main() {
/// The Page-Hinkley Test for detecting abrupt mean shifts.
///
/// # Arguments
/// * `threshold` - The allowed deviation before triggering.
/// * `alpha` - The "forgetting" factor.
pub struct PageHinkley {
    mean: f64,
    sum: f64,
    count: usize,
    cumulative_sum: f64,
    min_cumulative_sum: f64,
    threshold: f64,
    alpha: f64,
}

impl PageHinkley {
    pub fn new(threshold: f64, alpha: f64) -> Self {
        Self {
            mean: 0.0,
            sum: 0.0,
            count: 0,
            cumulative_sum: 0.0,
            min_cumulative_sum: 0.0,
            threshold,
            alpha,
        }
    }

    pub fn update(&mut self, x: f64) -> bool {
        self.count += 1;
        // Update running mean Welford style or simple sum
        self.sum += x;
        self.mean = self.sum / self.count as f64;
        
        // Update CUSUM
        // m_T = Sum(x_t - mean - alpha)
        let diff = x - self.mean - self.alpha;
        self.cumulative_sum += diff;
        
        // Track the minimum CUSUM seen so far
        if self.cumulative_sum < self.min_cumulative_sum {
            self.min_cumulative_sum = self.cumulative_sum;
        }

        // Check Trigger
        // PH_T = m_T - M_T
        let drift = self.cumulative_sum - self.min_cumulative_sum;
        if drift > self.threshold {
            // Unconditionally reset state after drift logic
            self.min_cumulative_sum = 0.0;
            self.cumulative_sum = 0.0;
            return true;
        }
        false
    }
}
}

2. ADWIN (Adaptive Windowing)

The gold standard for streaming drift detection.

  • Concept: Maintain a window $W$ of varying length.
  • Cut: If the mean of two sub-windows (Head and Tail) differs significantly (using Hoeffding bounds), drop the Tail (old data).
  • Output: The window size itself is a proxy for stability. If window shrinks, drift occurred.

Rust Implementation (Full with Buckets):

#![allow(unused)]
fn main() {
use std::collections::VecDeque;

#[derive(Debug, Clone)]
struct Bucket {
    total: f64,
    variance: f64,
    count: usize, // usually 2^k
}

pub struct Adwin {
    delta: f64, // Confidence parameter (e.g. 0.002)
    width: usize, // Current window width (number of items)
    total: f64, // Sum of items in window
    buckets: VecDeque<Bucket>, // Exponential Histogram structure
    max_buckets: usize, // Max buckets per row of bit-map
}

impl Adwin {
    /// Create a new ADWIN detector with specific confidence delta.
    pub fn new(delta: f64) -> Self {
        Self {
            delta,
            width: 0,
            total: 0.0,
            buckets: VecDeque::new(),
            max_buckets: 5,
        }
    }
    
    /// Insert a new value into the stream and return true if drift is detected.
    pub fn insert(&mut self, value: f64) -> bool {
        self.width += 1;
        self.total += value;
        // Add new bucket of size 1 at the head
        self.buckets.push_front(Bucket { total: value, variance: 0.0, count: 1 });
        
        // Compress buckets: If we have too many small buckets, merge them.
        self.compress_buckets();
        
        // Check for Drift: Do we need to drop the tail?
        self.check_drift()
    }
    
    fn check_drift(&mut self) -> bool {
        let mut drift_detected = false;
        // Iterate through all possible cut points
        // If |mu0 - mu1| > epsilon, cut tail.
        // Epsilon = sqrt(1/2m * ln(4/delta))
        // This effectively auto-sizes the window to the length of the "stable" concept.
        
        // (Full logic omitted for brevity, involves iterating buckets and calculating harmonic means)
        drift_detected
    }
    
    fn compress_buckets(&mut self) {
        // Implementation of M buckets of size 2^k
        // If M+1 buckets of size 2^k exist, merge oldest 2 into size 2^{k+1}
    }
}
}

3. Kolmogorov-Smirnov (KS) Window Test

For distribution drift (not just mean shift), we compare the empirical CDFs of two windows. Rust Implementation (Statrs):

#![allow(unused)]
fn main() {
use statrs::distribution::{Continuous, Normal};
// Pseudo-code
fn ks_test(window_ref: &[f64], window_curr: &[f64]) -> f64 {
    // 1. Sort both windows
    // 2. Compute max distance between CDFs
    // D = max |F_ref(x) - F_curr(x)|
    let d_stat = calculate_d_statistic(window_ref, window_curr);
    // 3. P-Value
    // If p < 0.05, distributions are different.
    p_value(d_stat)
}
}

Robust Statistics: Filtering Noise

Before you run Page-Hinkley, you must remove outliers. A single outlier ($1B sale) will trigger Drift logic incorrectly. Do not use Mean and StdDev. Use Median and MAD (Median Absolute Deviation).

$$ MAD = median(|x_i - median(X)|) $$

Rust + Polars Pre-Filter:

#![allow(unused)]
fn main() {
// In your ingest pipeline
let median = series.median().unwrap();
let mad = (series - median).abs().median().unwrap();
let limit = median + (3.0 * mad);

// Filter
let clean_series = series.filter(series.lt(limit))?;
}

Operationalizing Drift Detection

Drift isn’t just a metric; it’s a trigger in your Airflow/Dagster pipeline.

The “Drift Protocol”

  1. Monitor: Run ADWIN on the error stream (residual $y - \hat{y}$), not just the raw data.
  2. Trigger: If ADWIN shrinks window significantly (drift detected):
    • Level 1 (Warning): Alert the Slack channel.
    • Level 2 (Critical): Trigger an automated retraining job on the recent window (the data after the cut).
    • Level 3 (Fallback): Switch to a simpler, more robust model (e.g., switch from LSTM to Exponential Smoothing) until retraining completes.

Airflow DAG for Drift Checks

# drift_dag.py
from airflow import DAG
from airflow.operators.python import PythonOperator

def check_drift_job():
    # Load recent residuals from BigQuery
    # Run Page-Hinkley
    # If drift: Trigger "retrain_dag"
    pass

with DAG("drift_monitor_daily", schedule="@daily") as dag:
    check = PythonOperator(
        task_id="check_drift",
        python_callable=check_drift_job
    )

Vector Drift (High-Dimensional Drift)

Drift doesn’t just happen in scalars. In NLP/Embeddings, the mean vector can shift.

  • Scenario: A News site trains on 2020 news. In 2022, “Corona” means beer. In 2020, it meant virus. The embeddings shift.
  • Metric: Cosine Similarity between the “Training Centroid” and “Inference Centroid”.

Rust Implementation using ndarray:

#![allow(unused)]
fn main() {
use ndarray::{Array1, Array2};

pub fn check_vector_drift(ref_embeddings: &Array2<f32>, curr_embeddings: &Array2<f32>) -> f64 {
    // 1. Compute Centroids
    let ref_mean: Array1<f32> = ref_embeddings.mean_axis(ndarray::Axis(0)).unwrap();
    let curr_mean: Array1<f32> = curr_embeddings.mean_axis(ndarray::Axis(0)).unwrap();
    
    // 2. Compute Cosine Similarity
    let dot = ref_mean.dot(&curr_mean);
    let norm_a = ref_mean.dot(&ref_mean).sqrt();
    let norm_b = curr_mean.dot(&curr_mean).sqrt();
    
    dot / (norm_a * norm_b)
}
// If similarity < 0.9, TRIGGER RETRAIN.
}

Drift in Recommender Systems (Special Case)

Recommender systems suffer from a unique type of drift: Model-Induced Drift (Feedback Loop).

  • Model: Shows User X only Sci-Fi movies.
  • User X: Only watches Sci-Fi movies (because that’s all they see).
  • Data: Training data becomes 100% Sci-Fi.
  • Result: Model becomes narrower and narrower (Echo Chamber).

Detection: Monitor the Entropy of the recommended catalog. $$ H(X) = - \sum p(x) \log p(x) $$ If Entropy drops, your model is collapsing.

Simulation Studio

To test our drift detectors, we need a way to generate synthetic drift.

# scripts/simulate_drift.py
import numpy as np
import matplotlib.pyplot as plt

def generate_stream(n_samples=1000, drift_point=500, drift_type='sudden'):
    data = []
    mu = 0.0
    for i in range(n_samples):
        if i > drift_point:
            if drift_type == 'sudden':
                mu = 5.0
            elif drift_type == 'gradual':
                mu += 0.01
        
        data.append(np.random.normal(mu, 1.0))
    return data

# Generate
stream = generate_stream()
plt.plot(stream)
plt.title("Simulated Concept Drift")
plt.savefig("drift_sim.png")

Visualization Dashboard

The most useful plot for debugging drift is the ADWIN Window Size.

  • Stable: Window size grows linearly (accumulation of evidence).
  • Drift: Window size crashes to 0 (forgetting history).
def plot_adwin_debug(events):
    # events list of (timestep, window_size)
    x, y = zip(*events)
    plt.plot(x, y)
    plt.xlabel("Time")
    plt.ylabel("Window Size (N)")
    plt.title("ADWIN Stability Monitor")

Anatomy of a Drift Event: A Timeline

DayEventMetric (MAPE)Detector SignalAction
0-30Normal Ops5%StableNone
31Competitor Promo5%StableNone
32Impact Begins7%P-Value droppingWarning
33Full Impact15%Drift DetectedTrigger Retrain
34Fallback Model8%StableDeployment
35New Model Live5%ResetRestore

Glossary

  • Virtual Drift: The input distribution $P(X)$ changes, but the decision boundary $P(y|X)$ remains the same. (Also called Covariate Shift).
  • Real Drift: The decision boundary changes. The model is effectively wrong.
  • Sudden Drift: A step change (e.g., specific law change).
  • Gradual Drift: A slow evolution (e.g., inflation, aging machinery).
  • Survival Analysis: Estimating “Time to Drift”.
  • Bifurcation: When a single concept splits into two (e.g., “Phone” splits into “Smartphone” and “Feature Phone”).

Literature Review

  • Gama et al. (2004) - Learning from Data Streams (ADWIN).
  • Bifet et al. (2018) - Massive Online Analysis (MOA).
  • Ditzler et al. (2015) - Learning in Non-Stationary Environments: A Survey.

Troubleshooting Guide

SymptomDiagnosisFix
High False Positive RateThreshold too sensitiveDecrease delta (confidence) in ADWIN (e.g., 0.002 -> 0.0001).
Drift Detected Every DaySeasonalityYou are detecting the daily cycle. De-trend data first.
Laggy DetectionWindow too largeUse Page-Hinkley for faster responses to mean shifts.
OOMInfinite MemoryEnsure ADWIN buckets are merging correctly (logarithmic growth).

Summary Checklist

  1. De-seasonalize: Never run drift detection on raw data if it has daily/weekly cycles.
  2. Monitor Residuals: The most important signal is “Is the model error increasing?”, not “Is the input mean changing?”.
  3. Automate: Drift detection without automated retraining is just noise. Connect the detected signal to the training API.
  4. Differentiate: Classify alerts as “Data Quality” (upstream fix) vs “Concept Drift” (model fix).
  5. Robustness: Use MAD, not StdDev, to ignore transient outliers.
  6. Windowing: Use ADWIN for auto-sizing windows; do not guess a fixed window size (like 30 days).
  7. Visualization: Dashboard the “ADWIN Window Size” metric. A shrinking window is the earliest warning sign of instability.
  8. Vector Check: For embeddings, check Cosine Similarity Centroid Drift annually.