40.1. Graph Feature Stores (Topology vs Attributes)
Status: Draft Version: 1.0.0 Tags: #GNN, #Graph, #Rust, #Databases Author: MLOps Team
Table of Contents
- The Anatomy of a Graph in MLOps
- The Separation of Concerns: Topology vs Attributes
- Storage Formats: Adjacency Lists vs CSR
- Rust Implementation: MMap Graph Engine
- System Architecture: The Graph Store
- Partitioning: METIS and Distributed IDs
- Infrastructure: RedisGraph vs Neo4j vs Custom
- Troubleshooting: Common Data Engineering Issues
- Future Trends: Hardware Acceleration
- MLOps Interview Questions
- Glossary
- Summary Checklist
Prerequisites
Before diving into this chapter, ensure you have the following installed:
- Rust: 1.70+ (
memmap2,flatbufferscrates) - Graph Tool:
metis(for partitioning benchmarks) - Python:
networkxfor visualizations.
The Anatomy of a Graph in MLOps
A Graph $G = (V, E)$ consists of:
- Nodes (Vertices): Entities (Users, Items, Transactions).
- Edges (Links): Relationships (Bought, Follows, SentMoney).
- Node Features: $X \in \mathbb{R}^{|V| \times d}$ (Dense vectors e.g. User Embeddings).
- Edge Features: $E_{feat} \in \mathbb{R}^{|E| \times k}$ (Transaction Amount, Timestamp).
The Scale Problem
- Small Graph: 10k nodes. Fits in Python
dglorPyGObject. - Medium Graph: 10M nodes. Fits in 64GB RAM.
- Giant Graph: 1 Billion nodes. 10 Billion edges. DOES NOT FIT IN RAM.
MLOps Challenge: How do you train a GNN on a 1TB Graph when your GPU only has 32GB VRAM?
The Separation of Concerns: Topology vs Attributes
You cannot query “Neighbors” and “Features” in the same database query efficiently. Topology (finding neighbors) requires pointer chasing. Features (vectors) require contiguous reads.
The Golden Rule of Graph Ops:
“Store the Topology in an optimized Graph Format (CSR). Store the Features in a dense Key-Value Store or Columnar file.”
[ Graph Service ]
|
| 1. Get Neighbors(Node A) -> [B, C, D]
v
[ Topology Engine (CSR in RAM/MMap) ]
|
| 2. Get Features([B, C, D]) -> Matrix 3x128
v
[ Feature Store (Redis / Parquet / LevelDB) ]
Storage Formats: Adjacency Lists vs CSR
Adjacency List (Naive)
Map of NodeID -> List[NodeID]
- Pros: Easy to modify (add edge).
- Cons: Memory fragmentation. Billions of
Vecallocations. Cache misses.
CSR (Compressed Sparse Row)
Standard format for High-Performance Computing (HPC) and GNN libraries. Three arrays:
row_ptr: Index where the edges for node $i$ start. (Length: $|V| + 1$).col_indices: The neighbors. (Length: $|E|$).values: Edge weights. (Length: $|E|$).
Example: Edges: (0->1), (0->3), (1->2), (2->3)
row_ptr: [0, 2, 3, 4, 4] (Node 0 starts at 0, Node 1 starts at 2…)col_indices: [1, 3, 2, 3]
Accessing Node $u$:
Neighbors = col_indices[row_ptr[u] .. row_ptr[u+1]]
This is a single contiguous memory slice. Extreme CPU cache locality.
Rust Implementation: MMap Graph Engine
We will build a Read-Only CSR engine backed by Memory Mapped files (OS paging). This allows serving a 100GB graph on a laptop with 16GB RAM (letting the OS manage swapping).
Project Structure
graph-store/
├── Cargo.toml
└── src/
└── lib.rs
Cargo.toml:
[package]
name = "graph-store"
version = "0.1.0"
edition = "2021"
[dependencies]
memmap2 = "0.7"
byteorder = "1.4"
serde = { version = "1.0", features = ["derive"] }
src/lib.rs:
#![allow(unused)]
fn main() {
//! High-Performance CSR Graph Store using Memory Mapping.
//! Allows zero-copy access to billion-scale graphs.
use memmap2::Mmap;
use std::fs::File;
use std::path::Path;
use std::slice;
pub struct CSRGraph {
/// Number of nodes
num_nodes: usize,
/// Number of edges
num_edges: usize,
/// Memory mapped arrays for Zero-Copy access
row_ptr_mmap: Mmap,
col_indices_mmap: Mmap,
}
impl CSRGraph {
/// Load a CSR graph from disk.
/// Expects two files: row_ptr.bin and col_idx.bin (little-endian i64).
pub fn load(path: &str) -> std::io::Result<Self> {
let row_ptr_file = File::open(Path::new(path).join("row_ptr.bin"))?;
let col_idx_file = File::open(Path::new(path).join("col_idx.bin"))?;
let row_ptr_mmap = unsafe { Mmap::map(&row_ptr_file)? };
let col_indices_mmap = unsafe { Mmap::map(&col_idx_file)? };
// Safety: We assume the files are valid i64 arrays
// In production, you would add a header with Magic Bytes and Version.
let num_nodes = (row_ptr_mmap.len() / 8) - 1;
let num_edges = col_indices_mmap.len() / 8;
println!("Loaded Graph: {} nodes, {} edges", num_nodes, num_edges);
Ok(Self {
num_nodes,
num_edges,
row_ptr_mmap,
col_indices_mmap,
})
}
/// Get raw slice of row pointers (u64)
fn get_row_ptrs(&self) -> &[u64] {
unsafe {
slice::from_raw_parts(
self.row_ptr_mmap.as_ptr() as *const u64,
self.num_nodes + 1,
)
}
}
/// Get raw slice of column indices (u64)
fn get_col_indices(&self) -> &[u64] {
unsafe {
slice::from_raw_parts(
self.col_indices_mmap.as_ptr() as *const u64,
self.num_edges,
)
}
}
/// The core operation: Get Neighbors
/// Zero-copy, Zero-allocation (returns slice)
/// This function is typically called 1M+ times per second during sampling.
pub fn get_neighbors(&self, node_id: usize) -> &[u64] {
if node_id >= self.num_nodes {
// Graceful handling of out-of-bounds
return &[];
}
// Pointers to the start and end of the neighbor list in the massive array
let ptrs = self.get_row_ptrs();
let start = ptrs[node_id] as usize;
let end = ptrs[node_id + 1] as usize;
// Retrieve the slice
let cols = self.get_col_indices();
// Bounds check (should essentially never fail if file is valid CSR)
if start > end || end > cols.len() {
return &[];
}
&cols[start..end]
}
}
}
System Architecture: The Graph Store
How do we construct these binary files?
# storage-layout.yaml
schema:
nodes:
- user: parquet/users/*.parquet
- item: parquet/items/*.parquet
edges:
- clicks: csv/clicks.csv
ETL Pipeline (Spark/Ray):
- Read Edges from Data Lake.
- Map String IDs (“User_A”) to Integer IDs (0).
- Sort edges by Source Node ID.
- Compute
row_ptrprefix sum. - Write
row_ptr.binandcol_idx.bin.
Partitioning: METIS and Distributed IDs
For Multi-GPU training, we use Graph Partitioning. We want to split the graph into $K$ parts such that the “Edge Cut” (edges going between partitions) is minimized. Why? Because every edge cut requires network communication between GPUs.
Tools:
- METIS: The gold standard for partitioning.
- Hash Partitioning:
node_id % K. Fast but terrible locality. Neighbors are scattered everywhere.
Distributed ID Mapping:
- Global ID:
64-bit Int - Partition ID: Top 8 bits.
- Local ID: Bottom 56 bits.
This allows simple routing:
Owner(NodeID) = NodeID >> 56.
Infrastructure: RedisGraph vs Neo4j vs Custom
| Solution | Type | Pros | Cons |
|---|---|---|---|
| Neo4j | Transactional DB | Cypher Query Language, ACID | Slow for whole-graph ML sampling. |
| RedisGraph | In-Memory (Matrix) | Fast linear algebra ops | Limited memory (RAM only). |
| DGL/PyG | DL Framework | Built for ML | Not a database. Training only. |
| Custom CSR (Rust) | Static File | Maximum Speed, Zero-Copy | Read-Only. Complex ETL. |
Recommendation: Use Neo4j for transactional updates (“User added friend”). Use Spark to dump efficient CSR files nightly for the training cluster. Use Custom Rust Service (like above) for low-latency inference.
Troubleshooting: Common Data Engineering Issues
Scenario 1: Node ID Mapping Hell
- Symptom: Embeddings look random.
Node[123](User A) is retrievingNode[123](User B)’s vector. - Cause: You re-generated the Integer IDs in a different order (e.g. non-deterministic Spark shuffle).
- Fix: Always persist the
string_id -> int_idmapping as an artifact (Parquet file). Use it for inference.
Scenario 2: OOM Loading Graph
- Symptom:
Process killed (OOM)when loading the graph. - Cause: You are trying to
read()the file into aVec<u64>. 10 billion edges * 8 bytes = 80GB RAM. - Fix: Use
mmap. This uses Virtual Memory and only loads active pages.
Scenario 3: Dangling Edges
- Symptom:
get_neighborsreturns ID999999, butnum_nodesis500. IndexOutOfBounds Panic. - Cause: Your edge list contains IDs that were filtered out of the Node list (e.g. deleted users).
- Fix: Run a strict Referential Integrity check step in ETL:
assert(edge.dst < num_nodes).
Future Trends: Hardware Acceleration
Graph processing is memory-bound (Random Access). New hardware is emerging to solve this:
- Graphcore IPUs: Processors with massive on-chip SRAM to store the graph topology, avoiding DRAM latency.
- CXL (Compute Express Link): Allows coherent memory sharing between CPU and GPU, enabling massive (TB-scale) unified memory graphs.
- NVMe-over-Fabrics: Remote direct access to SSDs for “Disk-based GNNs” (e.g., Microsoft’s Marius).
MLOps Interview Questions
-
Q: Why not use an Adjacency Matrix? A: Complexity $O(V^2)$. A graph with 1B nodes would require $10^{18}$ bytes (1 Exabyte) to store a matrix that is 99.999% zeros. CSR is $O(V + E)$.
-
Q: How do you handle “Super Nodes” (Celebrities)? A: Justin Bieber has 100M followers. A GNN aggregating neighbors for him will OOM. We must use Neighbor Sampling (pick random 50 neighbors) instead of full aggregation.
-
Q: What is the difference between Transductive and Inductive GNNs? A: Transductive assumes all nodes are present during training (Node Classification). Inductive (GraphSAGE) learns to generalize to unseen nodes by learning a function of features + neighbors. MLOps loves Inductive.
-
Q: Explain the “MMap” advantage. A:
mmapallows the kernel to page parts of the file into RAM on demand. If we have cold nodes (never accessed), they stay on disk. This is “Virtual Memory” for Graphs. -
Q: How do you update a CSR graph? A: You generally don’t. It’s immutable. To update, you use a Log-Structured Merge Tree (LSM) approach: One big read-only CSR + a small mutable Adjacency List (MemTable) for recent updates. Weekly compaction.
Glossary
- CSR (Compressed Sparse Row): Memory-efficient graph format using 3 arrays.
- Transductive: Learning on a fixed graph structure.
- Inductive: Learning to generate embeddings for new nodes without retraining.
- Metis: Graph partitioning algorithm.
- Egonet: The subgraph consisting of a central node and its immediate neighbors.
Summary Checklist
- Format: Convert raw CSV edge lists to binary CSR format (RowPtr/ColIdx) for 100x speedup.
- ID Mapping: Create a robust, versioned pipeline for
UUID -> Int64mapping. - Attributes: Store node features in a memory-mapped Numpy file (
.npy) aligned with Node IDs. - Sampling: Ensure your graph engine supports
get_neighbors(random=True)for efficient sub-sampling. - Partitioning: If Graph > RAM, use METIS to shard graph across machines.
- Validation: Check for “Dangling Edges” (Edge pointing to non-existent Node ID).
- Immutability: Treat Graph Snapshots as immutable artifacts. Don’t mutate in place.