46.1. Federated Learning at Scale
The Decentralized Training Paradigm
As privacy regulations tighten (GDPR, CCPA, EU AI Act) and data gravity becomes a more significant bottleneck, the centralized training paradigm—where all data is moved to a central data lake for model training—is becoming increasingly untenable for certain classes of problems. Federated Learning (FL) represents a fundamental shift in MLOps, moving the compute to the data rather than the data to the compute.
In a Federated Learning system, a global model is trained across multiple decentralized edge devices or servers holding local data samples, without exchanging them. This approach addresses critical challenges in privacy, data security, and access rights, but it introduces a new set of massive operational complexities that MLOps engineers must solve.
The Core Architectural Components
A production-grade Federated Learning system consists of four primary architectural layers:
- The Orchestration Server (The Coordinator): This is the central nervous system of the FL topology. It manages the training lifecycle, selects clients for participation, aggregates model updates, and manages the global model versioning.
- The Client Runtime (The Edge): This is the software stack running on the remote device (smartphone, IoT gateway, hospital server, or cross-silo enterprise server). It is responsible for local training, validation, and communication with the coordinator.
- The Aggregation Engine: The mathematical core that combines local model weights or gradients into a global update. This often involves complex secure multi-party computation (SMPC) protocols.
- The Governance & Trust Layer: The security framework that ensures malicious clients cannot poison the model and that the coordinator cannot infer private data from the updates (Differential Privacy).
Federated Learning Topologies
There are two distinct topologies in FL, each requiring different MLOps strategies:
1. Cross-Silo Federated Learning
- Context: A consortium of organizations (e.g., typically 2-100 banks or hospitals) collaborating to train a shared model.
- Compute Resources: High-performance servers with GPUs/TPUs.
- Connectivity: High bandwidth, reliable, always-on.
- Data Partitioning: Often non-IID (Independent and Identically Distributed) but relatively stable.
- State: Stateful clients.
- MLOps Focus: Security, governance, auditability, and precise version control.
2. Cross-Device Federated Learning
- Context: Training on millions of consumer devices (e.g., Android phones keypads, smart home assistants).
- Compute Resources: severely constrained (mobile CPUs/NPU), battery limited.
- Connectivity: Flaky, intermittent, WiFi-only constraints.
- Data Partitioning: Highly non-IID, unbalanced.
- State: Stateless clients (devices drop in and out).
- MLOps Focus: Scalability, fault tolerance, device profiling, and over-the-air (OTA) efficiency.
Operational Challenges in FL
- Communication Efficiency: Sending full model weights (e.g., a 7B LLM) to millions of devices is impossible. We need compression, varying dropout, and LoRA adapters.
- System Heterogeneity: Clients have vastly different hardware. Stragglers (slow devices) can stall the entire training round.
- Statistical Heterogeneity: Data on one user’s phone is not representative of the population. This “client drift” causes the optimization to diverge.
- Privacy Attacks: “Model Inversion” attacks can reconstruct training data from gradients. “Membership Inference” can determine if a specific user was in the training set.
46.1.1. Feature Engineering in a Federated World
In centralized ML, feature engineering is a batch process on a data lake. In FL, feature engineering must happen on the device, often in a streaming fashion, using only local context. This creates a “Feature Engineering Consistency” problem.
The Problem of Feature Skew
If the Android team implements a feature extraction logic for “time of day” differently than the iOS team, or differently than the server-side validator, the model will fail silently.
Solution: Portable Feature Definitions
We need a way to define features as code that can compile to multiple targets (Python for server, Java/Kotlin for Android, Swift for iOS, C++ for embedded).
Implementation Pattern: WASM-based Feature Stores
WebAssembly (WASM) is emerging as the standard for portable feature logic in FL.
#![allow(unused)]
fn main() {
// Rust implementation of a portable feature extractor compiling to WASM
// src/lib.rs
use wasm_bindgen::prelude::*;
use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize)]
pub struct RawInput {
pub timestamp_ms: u64,
pub location_lat: f64,
pub location_lon: f64,
pub battery_level: f32,
}
#[derive(Serialize, Deserialize)]
pub struct FeatureVector {
pub hour_of_day: u8,
pub is_weekend: bool,
pub battery_bucket: u8,
pub location_hash: String,
}
#[wasm_bindgen]
pub fn extract_features(input_json: &str) -> String {
let input: RawInput = serde_json::from_str(input_json).unwrap();
// Feature Logic strictly versioned here
let features = FeatureVector {
hour_of_day: ((input.timestamp_ms / 3600000) % 24) as u8,
is_weekend: is_weekend(input.timestamp_ms),
battery_bucket: (input.battery_level * 10.0) as u8,
location_hash: geohash::encode(
geohash::Coord { x: input.location_lon, y: input.location_lat },
5
).unwrap(),
};
serde_json::to_string(&features).unwrap()
}
fn is_weekend(ts: u64) -> bool {
// Deterministic logic independent of device locale
let day = (ts / 86400000) % 7;
day == 0 || day == 6
}
}
This WASM binary is versioned in the Model Registry and deployed to all clients alongside the model weights. This guarantees that hour_of_day is calculated exactly the same way on a Samsung fridge as it is on an iPhone.
Federated Preprocessing Pipelines
Data normalization (e.g., Z-score scaling) requires global statistics (mean, variance) which no single client possesses.
The Two-Pass Approach:
- Pass 1 (Statistics): The coordinator requests summary statistics (sum, sum_of_squares, count) from a random sample of clients. These are aggregated using Secure Aggregation to produce global mean and variance.
- Pass 2 (Training): The coordinator broadcasts the global scaler (mean, std_dev) to clients. Clients use this to normalize local data before computing gradients.
# Conceptual flow for Federated Statistics with Tensorflow Federated (TFF)
import tensorflow_federated as tff
@tff.federated_computation(tff.type_at_clients(tf.float32))
def get_global_statistics(client_data):
# Each client computes local sum and count
local_stats = tff.federated_map(local_sum_and_count, client_data)
# Securely aggregate to get global values
global_stats = tff.federated_sum(local_stats)
return global_stats
# The Coordinator runs this round first
global_mean, global_std = run_statistics_round(coordinator, client_selector)
# Then broadcasts for training
run_training_round(coordinator, client_selector, preprocessing_metadata={
'mean': global_mean,
'std': global_std
})
46.1.2. Cross-Silo Governance and Architecture
In Cross-Silo FL (e.g., between competing banks for fraud detection), trust is zero. The architecture must enforce that no raw data ever leaves the silo.
The “Sidecar” Architecture for FL Containers
A robust pattern for Cross-Silo FL is deploying a “Federated Sidecar” container into the partner’s Kubernetes cluster. This sidecar has limited egress permissions—it can only talk to the Aggregation Server, and only transmit encrypted gradients.
Reference Architecture: KubeFed for FL
# Kubernetes Deployment for a Federated Client Node
apiVersion: apps/v1
kind: Deployment
metadata:
name: fl-client-bank-a
namespace: federated-learning
spec:
replicas: 1
template:
spec:
containers:
# The actual Training Container (The Worker)
- name: trainer
image: bank-a/fraud-model:v1.2
volumeMounts:
- name: local-data
mountPath: /data
readOnly: true
# Network isolated - no egress
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
# The FL Sidecar (The Communicator)
- name: fl-sidecar
image: federated-platform/sidecar:v2.0
env:
- name: AGGREGATOR_URL
value: "grpcs://aggregator.federated-consortium.com:443"
- name: CLIENT_ID
value: "bank-a-node-1"
# Only this container has egress
volumes:
- name: local-data
persistentVolumeClaim:
claimName: sensitive-financial-data
The Governance Policy Registry
We need a policy engine (like Open Policy Agent - OPA) to enforce rules on the updates.
Example Policy: Gradient Norm Clipping To prevent a malicious actor from overwhelming the global model with massive weights (a “model poisoning” attack), we enforce strict clipping norms.
# OPA Policy for FL Updates
package fl.governance
default allow = false
# Allow update if...
allow {
valid_signature
gradient_norm_acceptable
differential_privacy_budget_ok
}
valid_signature {
# Cryptographic check of the client's identity
input.signature == crypto.verify(input.payload, input.cert)
}
gradient_norm_acceptable {
# Prevent model poisoning by capping the L2 norm of the update
input.metadata.l2_norm < 5.0
}
differential_privacy_budget_ok {
# Check if this client has exhausted their "privacy budget" (epsilon)
input.client_stats.current_epsilon < input.policy.max_epsilon
}
46.1.3. Secure Aggregation Protocols
Secure Aggregation ensures that the server never sees an individual client’s update in the clear. It only sees the sum of the updates.
One-Time Pad Masking (The Google Protocol)
The most common protocol (Bonawitz et al.) works by having pairs of clients exchange Diffie-Hellman keys to generate shared masking values.
- Client u generates a random vector $r_u$.
- Client u adds $r_u$ to their weights $w_u$.
- For every pair $(u, v)$, they agree on a random seed $s_{uv}$.
- If $u < v$, $u$ adds $PRG(s_{uv})$, else subtracts.
- When the server sums everyone, all $PRG(s_{uv})$ cancel out, leaving $\sum w_u$.
MLOps Implication: If a client drops out during the protocol (which happens 20% of the time in mobile), the sum cannot be reconstructed. Recovery requires complex “secret sharing” (Shamir’s Secret Sharing) to reconstruct the masks of dropped users without revealing their data.
Homomorphic Encryption (HE)
A more robust but computationally expensive approach is Fully Homomorphic Encryption (FHE). The clients encrypt their weights $Enc(w_u)$. The server computes $Enc(W) = \sum Enc(w_u)$ directly on the ciphertexts. The server cannot decrypt the result; only a trusted key holder (or a committee holding key shares) can.
Hardware Acceleration for FHE: Running HE on CPUs is notoriously slow (1000x overhead). We are seeing the rise of “FHE Accelerators” (ASICs and FPGA implementations) specifically for this.
Integration with NVIDIA Flare: NVIDIA Flare offers a pluggable aggregations strategy.
# Custom Aggregator in NVIDIA Flare
from nvflare.apis.shareable import Shareable
from nvflare.app_common.abstract.aggregator import Aggregator
class HomomorphicAggregator(Aggregator):
def __init__(self, he_context):
self.he_context = he_context
self.encrypted_sum = None
def accept(self, shareable: Shareable, fl_ctx) -> bool:
# Received encrypted weights
enc_weights = shareable.get_encrypted_weights()
if self.encrypted_sum is None:
self.encrypted_sum = enc_weights
else:
# Homomorphic Addition: + operation on ciphertext
self.encrypted_sum = self.he_context.add(
self.encrypted_sum,
enc_weights
)
return True
def aggregate(self, fl_ctx) -> Shareable:
# Return the encrypted sum to the clients for distributed decryption
return Shareable(data=self.encrypted_sum)
46.1.4. Update Compression and Bandwidth Optimization
In cross-device FL, bandwidth is the bottleneck. Uploading a 500MB ResNet model update from a phone over 4G is unacceptable.
Techniques for Bandwidth Reduction
- Federated Dropout: Randomly remove 20-40% of neurons for each client. They train a sub-network and upload a smaller sparse vector.
- Ternary Quantization: Quantize gradients to {-1, 0, 1}. This creates extreme compression (from 32-bit float to ~1.6 bits per parameter).
- Golomb Coding: Entropy coding optimized for sparse updates.
Differential Privacy (DP) as a Service
DP adds noise to the gradients to mask individual contributions. This is often parameterized by $\epsilon$ (epsilon).
- Local DP: Noise added on the device. High privacy, high utility loss.
- Central DP: Noise added by the trusted aggregator.
- Distributed DP: Noise added by shuffling or secure aggregation so the aggregator never sees raw values.
Managing The Privacy Budget In MLOps, $\epsilon$ is a resource like CPU or RAM. Each query to the data consumes budget. When the budget is exhausted, the data “locks.”
Tracking Epsilon in MLflow:
import mlflow
def log_privacy_metrics(round_id, used_epsilon, total_delta):
mlflow.log_metric("privacy_epsilon", used_epsilon, step=round_id)
mlflow.log_metric("privacy_delta", total_delta, step=round_id)
if used_epsilon > MAX_EPSILON:
alert_governance_team("Privacy Budget Exceeded")
stop_training()
46.1.5. Tools of the Trade: The FL Ecosystem
Open Source Frameworks
| Framework | Backer | Strength | Best For |
|---|---|---|---|
| TensorFlow Federated (TFF) | Research, Simulation | Research verification of algorithms | |
| PySyft | OpenMined | Privacy, Encryption | Heavy privacy requirements, healthcare |
| Flower (Flwr) | Independent | Mobile, Heterogeneous | Production deployment to iOS/Android |
| NVIDIA Flare | NVIDIA | Hospital/Medical Imaging | Cross-silo, HPC integration |
| FATE | WeBank | Fintech | Financial institution interconnects |
Implementing a Flower Client on Android
Flower is becoming the de-facto standard for mobile deployment because it is ML-framework agnostic (supports TFLite, PyTorch Mobile, etc.).
Android (Kotlin) Client Stub:
class MyFlowerClient(
private val tflite: Interpreter,
private val data: List<FloatArray>
) : Client {
override fun getParameters(): Array<ByteBuffer> {
// Extract weights from TFLite model
return tflite.getWeights()
}
override fun fit(
parameters: Array<ByteBuffer>,
config: Config
): FitRes {
// 1. Update local model with global parameters
tflite.updateWeights(parameters)
// 2. Train on local data (On-Device Training)
val loss = trainOneEpoch(tflite, data)
// 3. Return updated weights to server
return FitRes(
tflite.getWeights(),
data.size,
mapOf("loss" to loss)
)
}
override fun evaluate(
parameters: Array<ByteBuffer>,
config: Config
): EvaluateRes {
// Validation step
tflite.updateWeights(parameters)
val accuracy = runInference(tflite, testData)
return EvaluateRes(loss, data.size, mapOf("acc" to accuracy))
}
}
46.1.6. Over-the-Air (OTA) Management for FL
Managing the lifecycle of FL binaries is closer to MDM (Mobile Device Management) than standard Kubernetes deployments.
Versioning Matrix
You must track:
- App Version: The version of the binary (APK/IPA) installed on the phone.
- Runtime Version: The version of the FL library (e.g., Flower v1.2.0).
- Model Architecture Version: “MobileNetV2_Quantized_v3”.
- Global Model Checkpoint: “Round_452_Weights”.
If a client has an incompatible App Version (e.g., an old feature extractor), it must be rejected from the training round to prevent polluting the global model.
The Client Registry
A DynamoDB table usually serves as the state store for millions of clients.
{
"client_id": "uuid-5521...",
"device_class": "high-end-android",
"battery_status": "charging",
"wifi_status": "connected",
"app_version": "2.4.1",
"last_seen": "2024-03-20T10:00:00Z",
"eligibility": {
"can_train": true,
"rejection_reason": null
}
}
The Selector Service queries this table:
“Give me 1000 clients that are charging, on WiFi, running app version > 2.4, and have at least 2GB of RAM.”
46.1.7. FL-Specific Monitoring
Standard metrics (latency, error rate) are insufficient. We need FL Telemetry.
- Client Drop Rate: What % of clients disconnect mid-round? High drop rates indicate the training job is too heavy for the device.
- Straggler Index: The distribution of training times. The “tail latency” (p99) determines the speed of global convergence.
- Model Divergence: The distance (Euclidean or Cosine) between a client’s update and the global average. A sudden spike indicates “Model Poisoning” or a corrupted client.
- Cohort Fairness: Are we only training on high-end iPhones? We must monitor the distribution of participating device types to ensure the model works on budget Android phones too.
Visualizing Client Drift
We often use dimensionality reduction (t-SNE or PCA) on the updates (gradients) sent by clients.
- Cluster Analysis: If clients cluster tightly into 2 or 3 distinct groups, it suggests we have distinct data distributions (e.g., “Day Users” vs “Night Users”, or “bimodal usage patterns”).
- Action: This signals the need for Personalized Federated Learning, where we might train separate models for each cluster rather than forcing a single global average.
46.1.8. Checklist for Production Readiness
- Client Selection: Implemented logic to only select devices on WiFi/Charging.
- Versioning: Host/Client compatibility checks in place.
- Bandwidth: Gradient compression (quantization/sparsification) active.
- Privacy: Differential Privacy budget tracking active.
- Security: Secure Aggregation enabled; model updates signed.
- Fallbacks: Strategy for when >50% of clients drop out of a round.
- Evaluation: Federated evaluation rounds separate from training rounds.
46.1.9. Deep Dive: Mathematical Foundations of Secure Aggregation
To truly understand why FL is “secure,” we must prove the mathematical guarantees of the aggregation protocols.
The Bonawitz Algorithm (2017) Detailed
Let $U$ be the set of users. For each pair of users $(u, v)$, they agree on a symmetric key $s_{uv}$. The value $u$ adds to their update $x_u$ is: $$ y_u = x_u + \sum_{v > u} PRG(s_{uv}) - \sum_{v < u} PRG(s_{uv}) $$
When the server sums $y_u$: $$ \sum_u y_u = \sum_u x_u + \sum_u (\sum_{v > u} PRG(s_{uv}) - \sum_{v < u} PRG(s_{uv})) $$
The double summation terms cancel out exactly.
- Proof: For every pair ${i, j}$, the term $PRG(s_{ij})$ is added exactly once (by $i$ when $i < j$) and subtracted exactly once (by $j$ when $j > i$).
- Result: The server sees $\sum x_u$ but sees nothing about an individual $x_u$, provided that at least one honest participant exists in the summation who keeps their $s_{uv}$ secret.
Differential Privacy: The Moments Accountant
Standard Composition theorems for DP are too loose for deep learning (where we might do 10,000 steps). The Moments Accountant method tracks the specific privacy loss random variable and bounds its moments.
Code Implementation: DP-SGD Optimizer from Scratch
import torch
from torch.optim import Optimizer
class DPSGD(Optimizer):
def __init__(self, params, lr=0.1, noise_multiplier=1.0, max_grad_norm=1.0):
defaults = dict(lr=lr, noise_multiplier=noise_multiplier, max_grad_norm=max_grad_norm)
super(DPSGD, self).__init__(params, defaults)
def step(self):
"""
Performs a single optimization step with Differential Privacy.
1. Clip Gradients (per sample).
2. Add Gaussian Noise.
3. Average.
"""
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
# 1. Per-Sample Gradient Clipping
# Note: In PyTorch vanilla, p.grad is already the MEAN of the batch.
# To do true DP, we need "Ghost Clipping" or per-sample gradients
# (using Opacus library).
# This is a simplified "Batch Processing" view for illustration.
grad_norm = p.grad.norm(2)
clip_coef = group['max_grad_norm'] / (grad_norm + 1e-6)
clip_coef = torch.clamp(clip_coef, max=1.0)
p.grad.mul_(clip_coef)
# 2. Add Noise
# Noise scale = (noise_multiplier * max_grad_norm) / batch_size
# Since we are working with averaged gradients:
noise = torch.normal(
mean=0.0,
std=group['noise_multiplier'] * group['max_grad_norm'],
size=p.grad.shape,
device=p.grad.device
)
# 3. Apply Update
p.data.add_(-group['lr'], p.grad + noise)
46.1.10. Operational Playbook: Handling Failures
In a fleet of 10 million devices, “rare” errors happen every second.
Scenario A: The “Poisoned Model” Rollback
Symptoms:
- Global model accuracy drops by 20% in one round.
- Validation loss spikes to NaN.
Root Cause:
- A malicious actor injected gradients to maximize error (Byzantine Attack).
- OR: A software bug in
ExtractFeaturescaused integer overflow on a specific Android version.
Recovery Protocol:
- Stop the Coordinator:
systemctl stop fl-server. - Identify the Bad Round: Look at the “Model Divergence” metric in Grafana.
- Rollback:
git checkout models/global_v451.pt(The last good state). - Device Ban: Identify the Client IDs that participated in Round 452. Mark them as
SUSPENDEDin DynamoDB. - Resume: Restart the coordinator with the old weights.
Scenario B: The “Straggler” Gridlock
Symptoms:
- Round 105 has been running for 4 hours (average is 5 mins).
- Waiting on 3 clients out of 1000.
Root Cause:
- Clients are on weak WiFi or have gone offline without sending
FIN.
Recovery Protocol:
- Timeouts: Set a strict
round_timeout_seconds = 600. - Partial Aggregation: If $> 80%$ of clients have reported, close the round and ignore the stragglers.
- Trade-off: This biases the model towards “Fast Devices” (New iPhones), potentially hurting performance on “Slow Devices” (Old Androids). This is a Fairness Issue.
46.1.11. Reference Architecture: Terraform for Cross-Silo FL
Setting up a secure aggregation server on AWS with enclave support.
# main.tf
provider "aws" {
region = "us-east-1"
}
# 1. The Coordinator Enclave (Nitro Enclaves)
resource "aws_instance" "fl_coordinator" {
ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2 with Nitro Enclave support
instance_type = "m5.xlarge" # Nitro supported
enclave_options {
enabled = true
}
iam_instance_profile = aws_iam_instance_profile.fl_coordinator_profile.name
vpc_security_group_ids = [aws_security_group.fl_sg.id]
user_data = <<-EOF
#!/bin/bash
yum install -y nitro-enclaves-cli nitro-enclaves-cli-devel
systemctl enable nitro-enclaves-allocator.service
systemctl start nitro-enclaves-allocator.service
# Allocate hugepages for the enclave
# 2 CPU, 6GB RAM
nitro-cli run-enclave --cpu-count 2 --memory 6144 \
--eif-path /home/ec2-user/server.eif \
--enclave-cid 10
EOF
}
# 2. The Client Registration Table
resource "aws_dynamodb_table" "fl_clients" {
name = "fl-client-registry"
billing_mode = "PAY_PER_REQUEST"
hash_key = "client_id"
range_key = "last_seen_timestamp"
attribute {
name = "client_id"
type = "S"
}
attribute {
name = "last_seen_timestamp"
type = "N"
}
ttl {
attribute_name = "ttl"
enabled = true
}
}
# 3. Model Storage (Checkpointing)
resource "aws_s3_bucket" "fl_models" {
bucket = "enterprise-fl-checkpoints-v1"
}
resource "aws_s3_bucket_versioning" "fl_models_ver" {
bucket = aws_s3_bucket.fl_models.id
versioning_configuration {
status = "Enabled"
}
}
46.1.12. Vendor Landscape Analysis (2025)
| Vendor | Product | Primary Use Case | Deployment Model | Pricing |
|---|---|---|---|---|
| NVIDIA | Flare (NVFlare) | Medical Imaging, Financial Services | Self-Hosted, sidecar container | Open Source / Enterprise Support |
| HP | Swarm Learning | Blockchain-based FL (Decenteralized Coordinator) | On-Prem / Edge | Licensing |
| Gboard FL | Mobile Keyboards (Internal Tech now public via TFF) | Mobile (Android) | Free (OSS) | |
| Sherpa.ai | Sherpa | Privacy-Preserving AI | SaaS / Hybrid | Enterprise |
| OpenMined | PyGrid | Research & Healthcare | Self-Hosted | Open Source |
Feature Comparison: NVFlare vs. Flower
NVIDIA Flare:
- Architecture: Hub-and-Spoke with strict “Site” definitions.
- Security: Built-in support for HA (High Availability) and Root-of-Trust.
- Simulators: Accurate simulation of multi-threaded clients on a single GPU.
- Best For: When you control the nodes (e.g., 5 hospitals).
Flower:
- Architecture: Extremely lightweight client (just a callback function).
- Mobile: First-class support for iOS/Android/C++.
- Scaling: Tested up to 10M concurrent clients.
- Best For: When you don’t control the nodes (Consumer devices).
46.1.13. Future Trends: Federated LLMs
Evaluating the feasibility of training Llama-3 (70B) via FL.
The Bottleneck:
- Parameter size: 140GB (BF16).
- Upload speed: 20Mbps (Consumer Uplink).
- Time to upload one update: $140,000 \text{ MB} / 2.5 \text{ MB/s} \approx 56,000 \text{ seconds} \approx 15 \text{ hours}$.
- Conclusion: Full fine-tuning of LLMs on consumer edge is impossible today.
The Solution: PEFT + QLoRA
- Instead of updating 70B params, we update LoRA Adapters (Rank 8).
- Adapter Size: ~10MB.
- Upload time: 4 seconds.
- Architecture:
- Frozen Backbone: The 70B weights are pre-loaded on the device (or streamed).
- Trainable Parts: Only the Adapter matrices $A$ and $B$.
- Aggregation: The server aggregates only the adapters.
# Federated PEFT Configuration (Concept)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1
)
# On Client
def train_step(model, batch):
# Only gradients for 'lora_A' and 'lora_B' are computed
loss = model(batch)
loss.backward()
# Extract only the adapter gradients for transmission
adapter_grads = {k: v.grad for k, v in model.named_parameters() if "lora" in k}
return adapter_grads
46.1.14. Case Study: Predictive Maintenance in Manufacturing
The User: Global Heavy Industry Corp (GHIC). The Assets: 5,000 Wind Turbines across 30 countries. The Problem: Turbine vibration data is 10TB/day. Satellite internet is expensive ($10/GB).
The FL Solution:
- Edge Compute: NVIDIA Jetson mounted on every turbine.
- Local Training: An Autoencoder learns the “Normal Vibration Pattern” for that specific turbine.
- Federated Round: Every night, turbines send updates to a global “Anomaly Detector” model.
- Bandwidth Savings:
- Raw Data: 10TB/day.
- Model Updates: 50MB/day.
- Cost Reduction: 99.9995%.
Outcome: GHIC detected a gearbox failure signature in the North Sea turbines (high wind) and propagated the learned pattern to the Brazil turbines (low wind) before the Brazil turbines experienced the failure conditions.
46.1.15. Anti-Patterns in Federated Learning
1. “Just use the Centralized Hyperparameters”
- Mistake: Using
lr=0.001because it worked on the data lake. - Reality: FL optimization landscapes are “bumpy” due to non-IID data. You often need Server Learning Rates (applying the update to the global model) separate from Client Learning Rates.
2. “Assuming Client Availability”
- Mistake: Waiting for specific high-value clients to report.
- Reality: Clients die. Batteries die. WiFi drops. Your system must be statistically robust to any subset of clients disappearing.
3. “Ignoring System Heterogeneity”
- Mistake: Sending the same model to a standard iPhone 15 and a budget Android.
- Reality: The Android runs out of RAM (OOM) and crashes. You have biased your model towards rich users.
- Fix: Ordered Dropout. Structure the model so that “first 50% layers” is a valid sub-model for weak devices, and “100% layers” is for strong devices.
4. “Leakage via Metadata”
- Mistake: Encrypting the gradients but leaving the
client_idandtimestampvisible. - Reality: Side-channel attack. “This client sends updates at 3 AM” -> “User is an insomniac.”
46.1.16. Checklist: The Zero-Trust FL Deployment
Security Audit
- Attestation: Does the server verify the client runs a signed binary? (Android SafetyNet / iOS DeviceCheck).
- Man-in-the-Middle: Is TLS 1.3 pinned?
- Model Signing: Are global weights signed by the server private key?
Data Governance
- Right to be Forgotten: If User X deletes their account, can we “unlearn” their contribution? (Machine Unlearning is an active research field; typical answer: “Re-train from checkpoint before User X joined”).
- Purpose Limiation: Are we ensuring the model learns “Keyboard Prediction” and not “Credit Card Numbers”?
Performance
- Quantization: Are we using INT8 transfer?
- Caching: Do clients cache the dataset locally to avoid re-reading from flash storage every epoch?
Federated Learning allows us to unlock the “Dark Matter” of data—the petabytes of private, sensitive data living on edges that will never see a cloud data lake. It is the ultimate frontier of decentralized MLOps.