40.3. Temporal GNNs & Dynamic Graphs

Memory: A vector $s_u$ storing the node’s history.
Message Function: $m_u(t) = MultiLayerPerceptron(s_u, s_v, \Delta t, e_{uv})$.
Memory Updater: $s_u(t) = GRU(m_u(t), s_u(t-1))$.
Embedding Module: $z_u(t) = GNN(s_u(t), \text{neighbors})$.

Time Encoding: Neural networks can’t understand raw timestamps. We use Harmonic Encoding (like Transformers): $$ \Phi(t) = [\cos(\omega_1 t), \sin(\omega_1 t), \dots, \cos(\omega_d t), \sin(\omega_d t)] $$

Rust Implementation: Temporal Memory Module

In Python, looping over 1 million events to update GRU states is slow ($O(N)$ Python overhead). We implement the Memory Updater in Rust using concurrent hash maps.

Project Structure

tgn-memory/
├── Cargo.toml
└── src/
    └── lib.rs

Cargo.toml:

[package]
name = "tgn-memory"
version = "0.1.0"
edition = "2021"

[dependencies]
dashmap = "5.5"  # Concurrent HashMap
ndarray = "0.15" # Math
rayon = "1.7"
serde = { version = "1.0", features = ["derive"] }

src/lib.rs:

#![allow(unused)]
fn main() {
//! Temporal Graph Memory Module.
//! Manages state vectors for millions of nodes with thread-safety.
//! Designed to handle high-throughput event streams from Kafka.

use dashmap::DashMap;
use ndarray::{Array1, Array2};
use std::sync::Arc;

const MEMORY_DIM: usize = 128;

#[derive(Clone, Debug)]
pub struct NodeMemory {
    /// The hidden state of the node (e.g. from GRU)
    /// Represents the "compressed history" of the node.
    pub state: Array1<f32>,
    /// Last time this node was updated. 
    /// Used to calculate dt for time encoding.
    pub last_update: f64,
}

impl NodeMemory {
    pub fn new() -> Self {
        Self {
            state: Array1::zeros(MEMORY_DIM),
            last_update: 0.0,
        }
    }
}

pub struct TemporalMemory {
    /// Thread-safe Map: NodeID -> MemoryState
    /// DashMap uses sharding to reduce lock contention.
    store: Arc<DashMap<usize, NodeMemory>>,
}

impl TemporalMemory {
    pub fn new() -> Self {
        Self {
            store: Arc::new(DashMap::new()),
        }
    }

    /// Process a batch of events (Source, Dest, Timestamp, EdgeFeat).
    /// Updates the memory of source and destination nodes interactively.
    pub fn update_batch(&self, events: Vec<(usize, usize, f64)>) {
        // Parallel update is tricky because source and dest might conflict (Data Race).
        // DashMap handles locking per-shard internally, preventing panic.
        // In a real TGN, we must process strictly in time order, so strict parallelism is limited 
        // within a batch unless we guarantee no node collisions.
        
        events.iter().for_each(|&(src, dst, t)| {
            // Update Source Node Memory
            self.store.entry(src)
                .and_modify(|mem| {
                    let dt = t - mem.last_update;
                    // Mock GRU: Decay + Input
                    // s(t) = s(t-1) * 0.9 + 0.1
                    mem.state.mapv_inplace(|x| x * 0.9 + 0.1); 
                    mem.last_update = t;
                })
                .or_insert_with(|| {
                    let mut mem = NodeMemory::new();
                    mem.last_update = t;
                    mem
                });

            // Update Destination Node Memory
            self.store.entry(dst)
                .and_modify(|mem| {
                    let dt = t - mem.last_update;
                    mem.state.mapv_inplace(|x| x * 0.9 + 0.1);
                    mem.last_update = t;
                })
                .or_insert_with(|| {
                    let mut mem = NodeMemory::new();
                    mem.last_update = t;
                    mem
                });
        });
    }

    pub fn get_state(&self, node: usize) -> Option<Array1<f32>> {
        self.store.get(&node).map(|m| m.state.clone())
    }
}
}

Streaming Architecture: Feature Stores for TGNs

TGN requires reading the “Memory” ($S_u$) and the “Raw Features” ($X_u$). Since Memory changes with every interaction, it must be in RAM (Redis or In-Process).

[ Kafka: Edge Stream ]
       |
       v
[ Rust TGN Ingestor ]
       |
       +---> [ Update Memory $S_u$ (In-RAM DashMap) ]
       |
       +---> [ Append to Graph Store (CSR) ]
       |
       +---> [ Publish "Enriched Event" ] ---> [ Inference Service ]

Consistency: The Inference Service must use the exact same Memory state $S_u$ that the model expects. This means the Ingestor is also the State Server.

Training Strategies: Snapshot vs Event-Based

1. Backprop Through Time (BPTT)

Like training an RNN. Split the event stream into batches of 200 events. Run TGN. Update Weights.

Problem: Gradients vanish over long time horizons.

2. Snapshot Training (Discrete Approximation)

Accumulate events for 1 hour. Build a static graph. Train GraphSAGE.

Problem: Latency. User A acted 55 mins ago, but model only sees it now.

Recommendation: Use TGN for critical “Attack Detection” (Milliseconds matter). Use Snapshot GraphSAGE for “Friend Recommendation” (Daily updates needed).

Handling Late-Arriving Events

Events in distributed systems arrive out of order. Event A (10:00) arrives after Event B (10:05). If TGN updates memory with B, then A arrives… the state is corrupted (Causality violation).

Solutions:

Buffer & Sort: Wait 10 seconds, sort by timestamp, then process.
Optimistic Processing: Process anyway. Accept noise.
Watermarks: Flink-style watermarking. Drop events older than $T_{late}$.

Infrastructure: Kafka to Graph Store

A robust setup uses Change Data Capture (CDC) from the core DB to drive the Graph.

# pipeline.yaml
sources:
  - name: transactions_db
    type: postgres-cdc
    
transforms:
  - name: to_edge_format
    query: "SELECT user_id as src, merchant_id as dst, amount as weight, ts FROM stream"

sinks:
  - name: graph_topic
    type: kafka
    topic: edges_v1

The GNN Service consumes edges_v1.

Troubleshooting: TGN Training Issues

Scenario 1: Memory Staleness

Symptom: Validation accuracy drops over time.
Cause: The “last update time” for many nodes is very old (e.g. inactive users). The TGN acts weird mainly on large $\Delta t$.
Fix: Implement a Time Decay in the Memory Updater. Force the state to zero if $\Delta t > 30 \text{days}$.

Scenario 2: Exploding Gradients

Symptom: Loss becomes NaN.
Cause: The GRU is unrolled for too many steps (Backprop through 1000 interactions).
Fix: Truncated BPTT. Detach gradients after 20 steps.

Scenario 3: Leakage

Symptom: Test AUC is 0.99 (suspiciously high).
Cause: You are using edges from the future (Target) to update the Memory (Input).
Fix: Strict ordering:
1. Predict Interaction $(u, v, t)$.
2. Calculate Loss.
3. Update Memory with $(u, v, t)$. Never swap 2 and 3.

Future Trends: Causal GNNs

Current GNNs look at correlations. “People who bought X also bought Y”. Causal GNNs ask “If I recommend X, will they buy Y?”. This requires Intervention Modeling (Do-calculus on Graphs). This is the next frontier for “Actionable RecSys”.

MLOps Interview Questions

Q: Why not just put “Time” as a feature in a static GNN? A: A static GNN aggregates all neighbors equally. It cannot distinguish “Neighbor from 2010” vs “Neighbor from 2024”. TGN’s memory module explicitly decays old information.
Q: What is the bottleneck in TGN inference? A: Sequential dependency. To compute $S_u(t)$, you strictly need $S_u(t-1)$. You cannot parallelize processing of a single node’s history. But you can parallelize across different nodes.
Q: How do you evaluate a Dynamic Graph model? A: You cannot use random K-Fold Split. You must use Temporal Split.
- Train: Jan - Nov.
- Test: Dec.
- Eval: Metric (AP/AUC) on future edges.
Q: Explain “Inductive” capability in TGNs. A: New nodes start with empty Memory $S_{new} = \vec{0}$. The model can still process them immediately using their raw features and the interactions they just comprised. No re-training needed.
Q: What is “Temporal Neighbor Sampling”? A: When aggregating neighbors for a node at time $t$, we only look at interactions in $[t - \delta, t]$. We ignore the future (no leakage) and the very distant past (irrelevant).

Glossary

TGN (Temporal Graph Network): Architecture combining GNNs and RNNs (Memory).
CTDG (Continuous Time Dynamic Graph): Graph defined by a stream of timestamped events.
Harmonic Encoding: Using sine/cosine functions to represent continuous time values.
Snapshot: A static view of the graph at a specific point in time.
BPTT (Backpropagation Through Time): Gradient descent method for recurrent networks.

Summary Checklist

Timestamping: Ensure every edge in your DB has a created_at timestamp.
Sorting: Always sort interaction batches by time before feeding to TGN.
State Persistence: Periodically checkpoint the Rust DashMap (Memory) to disk/S3 so you can recover from crashes.
Latency: Measure the “Event-to-Memory-Update” latency. Should be < 100ms.
Validation: Check for “Future Leakage”. Ensure Test Set start time > Train Set end time.
Baselines: Always compare TGN against a simple “Recent Activity” heuristic or static GNN. TGN adds massive complexity; ensure it beats the baseline.

The MLOps Omni-Reference

40.3. Temporal GNNs & Dynamic Graphs

Table of Contents

Prerequisites

The Myth of the Static Graph

Dynamic Graph Types: Discrete vs Continuous

1. Discrete Time Dynamic Graphs (DTDG)

2. Continuous Time Dynamic Graphs (CTDG)

TGN (Temporal Graph Networks) Architecture