Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

40.1. Graph Feature Stores (Topology vs Attributes)

Status: Draft Version: 1.0.0 Tags: #GNN, #Graph, #Rust, #Databases Author: MLOps Team


Table of Contents

  1. The Anatomy of a Graph in MLOps
  2. The Separation of Concerns: Topology vs Attributes
  3. Storage Formats: Adjacency Lists vs CSR
  4. Rust Implementation: MMap Graph Engine
  5. System Architecture: The Graph Store
  6. Partitioning: METIS and Distributed IDs
  7. Infrastructure: RedisGraph vs Neo4j vs Custom
  8. Troubleshooting: Common Data Engineering Issues
  9. Future Trends: Hardware Acceleration
  10. MLOps Interview Questions
  11. Glossary
  12. Summary Checklist

Prerequisites

Before diving into this chapter, ensure you have the following installed:

  • Rust: 1.70+ (memmap2, flatbuffers crates)
  • Graph Tool: metis (for partitioning benchmarks)
  • Python: networkx for visualizations.

The Anatomy of a Graph in MLOps

A Graph $G = (V, E)$ consists of:

  1. Nodes (Vertices): Entities (Users, Items, Transactions).
  2. Edges (Links): Relationships (Bought, Follows, SentMoney).
  3. Node Features: $X \in \mathbb{R}^{|V| \times d}$ (Dense vectors e.g. User Embeddings).
  4. Edge Features: $E_{feat} \in \mathbb{R}^{|E| \times k}$ (Transaction Amount, Timestamp).

The Scale Problem

  • Small Graph: 10k nodes. Fits in Python dgl or PyG Object.
  • Medium Graph: 10M nodes. Fits in 64GB RAM.
  • Giant Graph: 1 Billion nodes. 10 Billion edges. DOES NOT FIT IN RAM.

MLOps Challenge: How do you train a GNN on a 1TB Graph when your GPU only has 32GB VRAM?


The Separation of Concerns: Topology vs Attributes

You cannot query “Neighbors” and “Features” in the same database query efficiently. Topology (finding neighbors) requires pointer chasing. Features (vectors) require contiguous reads.

The Golden Rule of Graph Ops:

“Store the Topology in an optimized Graph Format (CSR). Store the Features in a dense Key-Value Store or Columnar file.”

[ Graph Service ]
       |
       |  1. Get Neighbors(Node A) -> [B, C, D]
       v
[ Topology Engine (CSR in RAM/MMap) ] 
       |
       |  2. Get Features([B, C, D]) -> Matrix 3x128
       v
[ Feature Store (Redis / Parquet / LevelDB) ]

Storage Formats: Adjacency Lists vs CSR

Adjacency List (Naive)

Map of NodeID -> List[NodeID]

  • Pros: Easy to modify (add edge).
  • Cons: Memory fragmentation. Billions of Vec allocations. Cache misses.

CSR (Compressed Sparse Row)

Standard format for High-Performance Computing (HPC) and GNN libraries. Three arrays:

  1. row_ptr: Index where the edges for node $i$ start. (Length: $|V| + 1$).
  2. col_indices: The neighbors. (Length: $|E|$).
  3. values: Edge weights. (Length: $|E|$).

Example: Edges: (0->1), (0->3), (1->2), (2->3)

  • row_ptr: [0, 2, 3, 4, 4] (Node 0 starts at 0, Node 1 starts at 2…)
  • col_indices: [1, 3, 2, 3]

Accessing Node $u$: Neighbors = col_indices[row_ptr[u] .. row_ptr[u+1]] This is a single contiguous memory slice. Extreme CPU cache locality.


Rust Implementation: MMap Graph Engine

We will build a Read-Only CSR engine backed by Memory Mapped files (OS paging). This allows serving a 100GB graph on a laptop with 16GB RAM (letting the OS manage swapping).

Project Structure

graph-store/
├── Cargo.toml
└── src/
    └── lib.rs

Cargo.toml:

[package]
name = "graph-store"
version = "0.1.0"
edition = "2021"

[dependencies]
memmap2 = "0.7"
byteorder = "1.4"
serde = { version = "1.0", features = ["derive"] }

src/lib.rs:

#![allow(unused)]
fn main() {
//! High-Performance CSR Graph Store using Memory Mapping.
//! Allows zero-copy access to billion-scale graphs.

use memmap2::Mmap;
use std::fs::File;
use std::path::Path;
use std::slice;

pub struct CSRGraph {
    /// Number of nodes
    num_nodes: usize,
    /// Number of edges
    num_edges: usize,
    /// Memory mapped arrays for Zero-Copy access
    row_ptr_mmap: Mmap,
    col_indices_mmap: Mmap,
}

impl CSRGraph {
    /// Load a CSR graph from disk.
    /// Expects two files: row_ptr.bin and col_idx.bin (little-endian i64).
    pub fn load(path: &str) -> std::io::Result<Self> {
        let row_ptr_file = File::open(Path::new(path).join("row_ptr.bin"))?;
        let col_idx_file = File::open(Path::new(path).join("col_idx.bin"))?;
        
        let row_ptr_mmap = unsafe { Mmap::map(&row_ptr_file)? };
        let col_indices_mmap = unsafe { Mmap::map(&col_idx_file)? };
        
        // Safety: We assume the files are valid i64 arrays
        // In production, you would add a header with Magic Bytes and Version.
        let num_nodes = (row_ptr_mmap.len() / 8) - 1;
        let num_edges = col_indices_mmap.len() / 8;
        
        println!("Loaded Graph: {} nodes, {} edges", num_nodes, num_edges);
        
        Ok(Self {
            num_nodes,
            num_edges,
            row_ptr_mmap,
            col_indices_mmap,
        })
    }
    
    /// Get raw slice of row pointers (u64)
    fn get_row_ptrs(&self) -> &[u64] {
        unsafe {
            slice::from_raw_parts(
                self.row_ptr_mmap.as_ptr() as *const u64,
                self.num_nodes + 1,
            )
        }
    }

    /// Get raw slice of column indices (u64)
    fn get_col_indices(&self) -> &[u64] {
        unsafe {
            slice::from_raw_parts(
                self.col_indices_mmap.as_ptr() as *const u64,
                self.num_edges,
            )
        }
    }

    /// The core operation: Get Neighbors
    /// Zero-copy, Zero-allocation (returns slice)
    /// This function is typically called 1M+ times per second during sampling.
    pub fn get_neighbors(&self, node_id: usize) -> &[u64] {
        if node_id >= self.num_nodes {
            // Graceful handling of out-of-bounds
            return &[];
        }
        
        // Pointers to the start and end of the neighbor list in the massive array
        let ptrs = self.get_row_ptrs();
        let start = ptrs[node_id] as usize;
        let end = ptrs[node_id + 1] as usize;
        
        // Retrieve the slice
        let cols = self.get_col_indices();
        
        // Bounds check (should essentially never fail if file is valid CSR)
        if start > end || end > cols.len() {
             return &[];
        }

        &cols[start..end]
    }
}
}

System Architecture: The Graph Store

How do we construct these binary files?

# storage-layout.yaml
schema:
  nodes: 
    - user: parquet/users/*.parquet
    - item: parquet/items/*.parquet
  edges:
    - clicks: csv/clicks.csv

ETL Pipeline (Spark/Ray):

  1. Read Edges from Data Lake.
  2. Map String IDs (“User_A”) to Integer IDs (0).
  3. Sort edges by Source Node ID.
  4. Compute row_ptr prefix sum.
  5. Write row_ptr.bin and col_idx.bin.

Partitioning: METIS and Distributed IDs

For Multi-GPU training, we use Graph Partitioning. We want to split the graph into $K$ parts such that the “Edge Cut” (edges going between partitions) is minimized. Why? Because every edge cut requires network communication between GPUs.

Tools:

  • METIS: The gold standard for partitioning.
  • Hash Partitioning: node_id % K. Fast but terrible locality. Neighbors are scattered everywhere.

Distributed ID Mapping:

  • Global ID: 64-bit Int
  • Partition ID: Top 8 bits.
  • Local ID: Bottom 56 bits. This allows simple routing: Owner(NodeID) = NodeID >> 56.

Infrastructure: RedisGraph vs Neo4j vs Custom

SolutionTypeProsCons
Neo4jTransactional DBCypher Query Language, ACIDSlow for whole-graph ML sampling.
RedisGraphIn-Memory (Matrix)Fast linear algebra opsLimited memory (RAM only).
DGL/PyGDL FrameworkBuilt for MLNot a database. Training only.
Custom CSR (Rust)Static FileMaximum Speed, Zero-CopyRead-Only. Complex ETL.

Recommendation: Use Neo4j for transactional updates (“User added friend”). Use Spark to dump efficient CSR files nightly for the training cluster. Use Custom Rust Service (like above) for low-latency inference.


Troubleshooting: Common Data Engineering Issues

Scenario 1: Node ID Mapping Hell

  • Symptom: Embeddings look random. Node[123] (User A) is retrieving Node[123] (User B)’s vector.
  • Cause: You re-generated the Integer IDs in a different order (e.g. non-deterministic Spark shuffle).
  • Fix: Always persist the string_id -> int_id mapping as an artifact (Parquet file). Use it for inference.

Scenario 2: OOM Loading Graph

  • Symptom: Process killed (OOM) when loading the graph.
  • Cause: You are trying to read() the file into a Vec<u64>. 10 billion edges * 8 bytes = 80GB RAM.
  • Fix: Use mmap. This uses Virtual Memory and only loads active pages.

Scenario 3: Dangling Edges

  • Symptom: get_neighbors returns ID 999999, but num_nodes is 500. IndexOutOfBounds Panic.
  • Cause: Your edge list contains IDs that were filtered out of the Node list (e.g. deleted users).
  • Fix: Run a strict Referential Integrity check step in ETL: assert(edge.dst < num_nodes).

Graph processing is memory-bound (Random Access). New hardware is emerging to solve this:

  1. Graphcore IPUs: Processors with massive on-chip SRAM to store the graph topology, avoiding DRAM latency.
  2. CXL (Compute Express Link): Allows coherent memory sharing between CPU and GPU, enabling massive (TB-scale) unified memory graphs.
  3. NVMe-over-Fabrics: Remote direct access to SSDs for “Disk-based GNNs” (e.g., Microsoft’s Marius).

MLOps Interview Questions

  1. Q: Why not use an Adjacency Matrix? A: Complexity $O(V^2)$. A graph with 1B nodes would require $10^{18}$ bytes (1 Exabyte) to store a matrix that is 99.999% zeros. CSR is $O(V + E)$.

  2. Q: How do you handle “Super Nodes” (Celebrities)? A: Justin Bieber has 100M followers. A GNN aggregating neighbors for him will OOM. We must use Neighbor Sampling (pick random 50 neighbors) instead of full aggregation.

  3. Q: What is the difference between Transductive and Inductive GNNs? A: Transductive assumes all nodes are present during training (Node Classification). Inductive (GraphSAGE) learns to generalize to unseen nodes by learning a function of features + neighbors. MLOps loves Inductive.

  4. Q: Explain the “MMap” advantage. A: mmap allows the kernel to page parts of the file into RAM on demand. If we have cold nodes (never accessed), they stay on disk. This is “Virtual Memory” for Graphs.

  5. Q: How do you update a CSR graph? A: You generally don’t. It’s immutable. To update, you use a Log-Structured Merge Tree (LSM) approach: One big read-only CSR + a small mutable Adjacency List (MemTable) for recent updates. Weekly compaction.


Glossary

  • CSR (Compressed Sparse Row): Memory-efficient graph format using 3 arrays.
  • Transductive: Learning on a fixed graph structure.
  • Inductive: Learning to generate embeddings for new nodes without retraining.
  • Metis: Graph partitioning algorithm.
  • Egonet: The subgraph consisting of a central node and its immediate neighbors.

Summary Checklist

  1. Format: Convert raw CSV edge lists to binary CSR format (RowPtr/ColIdx) for 100x speedup.
  2. ID Mapping: Create a robust, versioned pipeline for UUID -> Int64 mapping.
  3. Attributes: Store node features in a memory-mapped Numpy file (.npy) aligned with Node IDs.
  4. Sampling: Ensure your graph engine supports get_neighbors(random=True) for efficient sub-sampling.
  5. Partitioning: If Graph > RAM, use METIS to shard graph across machines.
  6. Validation: Check for “Dangling Edges” (Edge pointing to non-existent Node ID).
  7. Immutability: Treat Graph Snapshots as immutable artifacts. Don’t mutate in place.