37.3. Concept Drift in Sequential Data
In static learning (like ImageNet), a cat is always a cat. In Time Series, the rules of the game change constantly.
- Inflation: $100 in 1990 is not $100 in 2024.
- Seasonality: Sales in December are always higher than November. That’s not drift; that’s the calendar.
- Drift: Suddenly, your “Monday Model” fails because a competitor launched a promo.
This chapter defines rigorous methods to detect real concept drift ($P(y|X)$ changes) while ignoring expected temporal variations ($P(X)$ changes).
The Definition of Sequential Drift
Drift is a change in the joint distribution $P(X, y)$. We decompose it:
- Covariate Shift ($P(X)$ changes): The inputs change. (e.g., more users are browsing on mobile). The model might still work if the function hasn’t changed.
- Concept Drift ($P(y|X)$ changes): The relationship changes. (e.g., users on mobile used to buy X, now they buy Y). The model is broken.
- Label Shift ($P(y)$ changes): The output distribution changes (e.g., suddenly everyone is buying socks).
The Seasonality Trap
If your error rate spikes every weekend, you don’t have drift; you have a missing feature (is_weekend).
Rule: Always de-trend and de-seasonalize your metric before checking for drift.
Method: Run drift detection on the Residuals ($y - \hat{y}$), not the raw values.
Detection Algorithms
We need algorithms that process data streams in $O(1)$ memory and time.
1. Page-Hinkley Test (PHT)
A cumulative sum monitoring variation. Good for detecting Abrupt Changes in the mean.
Rust Implementation:
#![allow(unused)]
fn main() {
/// The Page-Hinkley Test for detecting abrupt mean shifts.
///
/// # Arguments
/// * `threshold` - The allowed deviation before triggering.
/// * `alpha` - The "forgetting" factor.
pub struct PageHinkley {
mean: f64,
sum: f64,
count: usize,
cumulative_sum: f64,
min_cumulative_sum: f64,
threshold: f64,
alpha: f64,
}
impl PageHinkley {
pub fn new(threshold: f64, alpha: f64) -> Self {
Self {
mean: 0.0,
sum: 0.0,
count: 0,
cumulative_sum: 0.0,
min_cumulative_sum: 0.0,
threshold,
alpha,
}
}
pub fn update(&mut self, x: f64) -> bool {
self.count += 1;
// Update running mean Welford style or simple sum
self.sum += x;
self.mean = self.sum / self.count as f64;
// Update CUSUM
// m_T = Sum(x_t - mean - alpha)
let diff = x - self.mean - self.alpha;
self.cumulative_sum += diff;
// Track the minimum CUSUM seen so far
if self.cumulative_sum < self.min_cumulative_sum {
self.min_cumulative_sum = self.cumulative_sum;
}
// Check Trigger
// PH_T = m_T - M_T
let drift = self.cumulative_sum - self.min_cumulative_sum;
if drift > self.threshold {
// Unconditionally reset state after drift logic
self.min_cumulative_sum = 0.0;
self.cumulative_sum = 0.0;
return true;
}
false
}
}
}
2. ADWIN (Adaptive Windowing)
The gold standard for streaming drift detection.
- Concept: Maintain a window $W$ of varying length.
- Cut: If the mean of two sub-windows (Head and Tail) differs significantly (using Hoeffding bounds), drop the Tail (old data).
- Output: The window size itself is a proxy for stability. If window shrinks, drift occurred.
Rust Implementation (Full with Buckets):
#![allow(unused)]
fn main() {
use std::collections::VecDeque;
#[derive(Debug, Clone)]
struct Bucket {
total: f64,
variance: f64,
count: usize, // usually 2^k
}
pub struct Adwin {
delta: f64, // Confidence parameter (e.g. 0.002)
width: usize, // Current window width (number of items)
total: f64, // Sum of items in window
buckets: VecDeque<Bucket>, // Exponential Histogram structure
max_buckets: usize, // Max buckets per row of bit-map
}
impl Adwin {
/// Create a new ADWIN detector with specific confidence delta.
pub fn new(delta: f64) -> Self {
Self {
delta,
width: 0,
total: 0.0,
buckets: VecDeque::new(),
max_buckets: 5,
}
}
/// Insert a new value into the stream and return true if drift is detected.
pub fn insert(&mut self, value: f64) -> bool {
self.width += 1;
self.total += value;
// Add new bucket of size 1 at the head
self.buckets.push_front(Bucket { total: value, variance: 0.0, count: 1 });
// Compress buckets: If we have too many small buckets, merge them.
self.compress_buckets();
// Check for Drift: Do we need to drop the tail?
self.check_drift()
}
fn check_drift(&mut self) -> bool {
let mut drift_detected = false;
// Iterate through all possible cut points
// If |mu0 - mu1| > epsilon, cut tail.
// Epsilon = sqrt(1/2m * ln(4/delta))
// This effectively auto-sizes the window to the length of the "stable" concept.
// (Full logic omitted for brevity, involves iterating buckets and calculating harmonic means)
drift_detected
}
fn compress_buckets(&mut self) {
// Implementation of M buckets of size 2^k
// If M+1 buckets of size 2^k exist, merge oldest 2 into size 2^{k+1}
}
}
}
3. Kolmogorov-Smirnov (KS) Window Test
For distribution drift (not just mean shift), we compare the empirical CDFs of two windows. Rust Implementation (Statrs):
#![allow(unused)]
fn main() {
use statrs::distribution::{Continuous, Normal};
// Pseudo-code
fn ks_test(window_ref: &[f64], window_curr: &[f64]) -> f64 {
// 1. Sort both windows
// 2. Compute max distance between CDFs
// D = max |F_ref(x) - F_curr(x)|
let d_stat = calculate_d_statistic(window_ref, window_curr);
// 3. P-Value
// If p < 0.05, distributions are different.
p_value(d_stat)
}
}
Robust Statistics: Filtering Noise
Before you run Page-Hinkley, you must remove outliers. A single outlier ($1B sale) will trigger Drift logic incorrectly.
Do not use Mean and StdDev. Use Median and MAD (Median Absolute Deviation).
$$ MAD = median(|x_i - median(X)|) $$
Rust + Polars Pre-Filter:
#![allow(unused)]
fn main() {
// In your ingest pipeline
let median = series.median().unwrap();
let mad = (series - median).abs().median().unwrap();
let limit = median + (3.0 * mad);
// Filter
let clean_series = series.filter(series.lt(limit))?;
}
Operationalizing Drift Detection
Drift isn’t just a metric; it’s a trigger in your Airflow/Dagster pipeline.
The “Drift Protocol”
- Monitor: Run ADWIN on the error stream (residual $y - \hat{y}$), not just the raw data.
- Trigger: If ADWIN shrinks window significantly (drift detected):
- Level 1 (Warning): Alert the Slack channel.
- Level 2 (Critical): Trigger an automated retraining job on the recent window (the data after the cut).
- Level 3 (Fallback): Switch to a simpler, more robust model (e.g., switch from LSTM to Exponential Smoothing) until retraining completes.
Airflow DAG for Drift Checks
# drift_dag.py
from airflow import DAG
from airflow.operators.python import PythonOperator
def check_drift_job():
# Load recent residuals from BigQuery
# Run Page-Hinkley
# If drift: Trigger "retrain_dag"
pass
with DAG("drift_monitor_daily", schedule="@daily") as dag:
check = PythonOperator(
task_id="check_drift",
python_callable=check_drift_job
)
Vector Drift (High-Dimensional Drift)
Drift doesn’t just happen in scalars. In NLP/Embeddings, the mean vector can shift.
- Scenario: A News site trains on 2020 news. In 2022, “Corona” means beer. In 2020, it meant virus. The embeddings shift.
- Metric: Cosine Similarity between the “Training Centroid” and “Inference Centroid”.
Rust Implementation using ndarray:
#![allow(unused)]
fn main() {
use ndarray::{Array1, Array2};
pub fn check_vector_drift(ref_embeddings: &Array2<f32>, curr_embeddings: &Array2<f32>) -> f64 {
// 1. Compute Centroids
let ref_mean: Array1<f32> = ref_embeddings.mean_axis(ndarray::Axis(0)).unwrap();
let curr_mean: Array1<f32> = curr_embeddings.mean_axis(ndarray::Axis(0)).unwrap();
// 2. Compute Cosine Similarity
let dot = ref_mean.dot(&curr_mean);
let norm_a = ref_mean.dot(&ref_mean).sqrt();
let norm_b = curr_mean.dot(&curr_mean).sqrt();
dot / (norm_a * norm_b)
}
// If similarity < 0.9, TRIGGER RETRAIN.
}
Drift in Recommender Systems (Special Case)
Recommender systems suffer from a unique type of drift: Model-Induced Drift (Feedback Loop).
- Model: Shows User X only Sci-Fi movies.
- User X: Only watches Sci-Fi movies (because that’s all they see).
- Data: Training data becomes 100% Sci-Fi.
- Result: Model becomes narrower and narrower (Echo Chamber).
Detection: Monitor the Entropy of the recommended catalog. $$ H(X) = - \sum p(x) \log p(x) $$ If Entropy drops, your model is collapsing.
Simulation Studio
To test our drift detectors, we need a way to generate synthetic drift.
# scripts/simulate_drift.py
import numpy as np
import matplotlib.pyplot as plt
def generate_stream(n_samples=1000, drift_point=500, drift_type='sudden'):
data = []
mu = 0.0
for i in range(n_samples):
if i > drift_point:
if drift_type == 'sudden':
mu = 5.0
elif drift_type == 'gradual':
mu += 0.01
data.append(np.random.normal(mu, 1.0))
return data
# Generate
stream = generate_stream()
plt.plot(stream)
plt.title("Simulated Concept Drift")
plt.savefig("drift_sim.png")
Visualization Dashboard
The most useful plot for debugging drift is the ADWIN Window Size.
- Stable: Window size grows linearly (accumulation of evidence).
- Drift: Window size crashes to 0 (forgetting history).
def plot_adwin_debug(events):
# events list of (timestep, window_size)
x, y = zip(*events)
plt.plot(x, y)
plt.xlabel("Time")
plt.ylabel("Window Size (N)")
plt.title("ADWIN Stability Monitor")
Anatomy of a Drift Event: A Timeline
| Day | Event | Metric (MAPE) | Detector Signal | Action |
|---|---|---|---|---|
| 0-30 | Normal Ops | 5% | Stable | None |
| 31 | Competitor Promo | 5% | Stable | None |
| 32 | Impact Begins | 7% | P-Value dropping | Warning |
| 33 | Full Impact | 15% | Drift Detected | Trigger Retrain |
| 34 | Fallback Model | 8% | Stable | Deployment |
| 35 | New Model Live | 5% | Reset | Restore |
Glossary
- Virtual Drift: The input distribution $P(X)$ changes, but the decision boundary $P(y|X)$ remains the same. (Also called Covariate Shift).
- Real Drift: The decision boundary changes. The model is effectively wrong.
- Sudden Drift: A step change (e.g., specific law change).
- Gradual Drift: A slow evolution (e.g., inflation, aging machinery).
- Survival Analysis: Estimating “Time to Drift”.
- Bifurcation: When a single concept splits into two (e.g., “Phone” splits into “Smartphone” and “Feature Phone”).
Literature Review
- Gama et al. (2004) - Learning from Data Streams (ADWIN).
- Bifet et al. (2018) - Massive Online Analysis (MOA).
- Ditzler et al. (2015) - Learning in Non-Stationary Environments: A Survey.
Troubleshooting Guide
| Symptom | Diagnosis | Fix |
|---|---|---|
| High False Positive Rate | Threshold too sensitive | Decrease delta (confidence) in ADWIN (e.g., 0.002 -> 0.0001). |
| Drift Detected Every Day | Seasonality | You are detecting the daily cycle. De-trend data first. |
| Laggy Detection | Window too large | Use Page-Hinkley for faster responses to mean shifts. |
| OOM | Infinite Memory | Ensure ADWIN buckets are merging correctly (logarithmic growth). |
Summary Checklist
- De-seasonalize: Never run drift detection on raw data if it has daily/weekly cycles.
- Monitor Residuals: The most important signal is “Is the model error increasing?”, not “Is the input mean changing?”.
- Automate: Drift detection without automated retraining is just noise. Connect the detected signal to the training API.
- Differentiate: Classify alerts as “Data Quality” (upstream fix) vs “Concept Drift” (model fix).
- Robustness: Use MAD, not StdDev, to ignore transient outliers.
- Windowing: Use ADWIN for auto-sizing windows; do not guess a fixed window size (like 30 days).
- Visualization: Dashboard the “ADWIN Window Size” metric. A shrinking window is the earliest warning sign of instability.
- Vector Check: For embeddings, check Cosine Similarity Centroid Drift annually.