41.4. Reality Gap Measurement

Status: Draft Version: 1.0.0 Tags: #Sim2Real, #Evaluation, #Metrics, #Python, #PyTorch Author: MLOps Team

The Silent Killer: Overfitting to Simulation
Quantifying the Gap: $KL(P_{sim} || P_{real})$
Visual Metrics: FID and KID
Dynamics Metrics: Trajectory Divergence
Python Implementation: SimGap Evaluator
Closing the Gap: System Identification (SysID)
Infrastructure: The Evaluation Loop
Troubleshooting: “My Simulator is Perfect” (It is not)
Future Trends: Real-to-Sim Gan
MLOps Interview Questions
Glossary
Summary Checklist

Prerequisites

Before diving into this chapter, ensure you have the following installed:

Python: torch, torchvision, scipy.
Data: A folder of Real Images and a folder of Sim Images.

The Silent Killer: Overfitting to Simulation

You train a robot to walk in Unity. It walks perfectly. You deploy it. It falls immediately. Why? The simulation floor was perfectly flat. The real floor has bumps. The simulation friction was constant 0.8. The real floor has dust.

The Reality Gap is the statistical distance between the distribution of states in Simulation $P_{sim}(s)$ and Reality $P_{real}(s)$. We cannot optimize what we cannot measure. We need a “Gap Score”.

Quantifying the Gap: $KL(P_{sim} || P_{real})$

Ideally, we want the Kullback-Leibler (KL) Divergence. But we don’t have the probability density functions. We only have samples (Images, Trajectories).

Two Axes of Divergence:

Visual Gap: The images look different (Lighting, Texture).
Dynamics Gap: The physics feel different (Mass, Friction, Latency).

Advanced Math: Wasserstein Distance (Earth Mover’s)

KL Divergence fails if the support of the two distributions doesn’t overlap (Infinite Gradient). The Wasserstein Metric ($W_1$) is robust to non-overlapping support. It measures the “work” needed to transport the probability mass of $P_{sim}$ to match $P_{real}$. $$ W_1(P_r, P_s) = \inf_{\gamma \in \Pi(P_r, P_s)} \mathbb{E}_{(x,y) \sim \gamma} [||x - y||] $$

Visual Metrics: FID and KID

FID (Frechet Inception Distance): Standard metric for GANs.

Feed Real Images into InceptionV3. Get Activations $A_{real}$.
Feed Sim Images into InceptionV3. Get Activations $A_{sim}$.
Compute Mean ($\mu$) and Covariance ($\Sigma$) of activations.
$FID = ||\mu_r - \mu_s||^2 + Tr(\Sigma_r + \Sigma_s - 2(\Sigma_r \Sigma_s)^{1/2})$.

Interpretation:

FID = 0: Perfect Match.
FID < 50: Good DR.
FID > 100: Huge Gap. Robot will fail.

Python Implementation: SimGap Evaluator

We write a tool to compute FID between two folders.

Project Structure

simgap/
├── main.py
└── metrics.py

metrics.py:

import torch
import torch.nn as nn
from torchvision.models import inception_v3
from scipy import linalg
import numpy as np
from torch.utils.data import DataLoader, TensorDataset

class FIDEvaluator:
    def __init__(self, device='cuda'):
        self.device = device
        # Load InceptionV3, remove classification head
        # We use the standard pre-trained weights from ImageNet
        self.model = inception_v3(pretrained=True, transform_input=False).to(device)
        self.model.fc = nn.Identity() # Replace last layer with Identity to get features
        self.model.eval()

    def get_activations(self, dataloader):
        acts = []
        with torch.no_grad():
            for batch in dataloader:
                batch = batch.to(self.device)
                # Inception expects 299x299 normalized images
                pred = self.model(batch)
                acts.append(pred.cpu().numpy())
        return np.concatenate(acts, axis=0)

    def calculate_fid(self, real_loader, sim_loader):
        print("Computing Real Activations...")
        act_real = self.get_activations(real_loader)
        mu_real, sigma_real = np.mean(act_real, axis=0), np.cov(act_real, rowvar=False)

        print("Computing Sim Activations...")
        act_sim = self.get_activations(sim_loader)
        mu_sim, sigma_sim = np.mean(act_sim, axis=0), np.cov(act_sim, rowvar=False)

        # Calculate FID Equation
        # ||mu_1 - mu_2||^2 + Tr(C_1 + C_2 - 2*sqrt(C_1*C_2))
        diff = mu_real - mu_sim
        covmean, _ = linalg.sqrtm(sigma_real.dot(sigma_sim), disp=False)
        
        # Numerical instability fix
        if np.iscomplexobj(covmean):
            covmean = covmean.real

        fid = diff.dot(diff) + np.trace(sigma_real + sigma_sim - 2 * covmean)
        return fid

main.py:

from metrics import FIDEvaluator
import torch
# ... data loading logic ...

def evaluate_pipeline(sim_folder, real_folder):
    evaluator = FIDEvaluator()
    # Assume get_loaders creates normalized tensors (3, 299, 299)
    # real_loader = ...
    # sim_loader = ...
    
    # fid_score = evaluator.calculate_fid(real_loader, sim_loader)
    # print(f"Reality Gap (FID): {fid_score:.2f}")
    
    # if fid_score > 50.0:
    #    print("FAIL: Visual Gap is too large. Increase Domain Randomization.")
    #    exit(1)

Dynamics Metrics: Trajectory Divergence

Visuals aren’t everything. Metric: NRMSE (Normalized Root Mean Square Error) of 3D Position.

Record Real Trajectory: $T_{real} = [(x_0, y_0), \dots, (x_n, y_n)]$.
Replay same Controls in Sim.
Record Sim Trajectory: $T_{sim}$.
$Error = \frac{1}{N} \sum ||T_{real}[i] - T_{sim}[i]||^2$.

Challenge: Alignment. Real world starts at $t=0.0$. Sim starts at $t=0.0$. But Real World has 20ms lag. You must Time Align the signals using Cross-Correlation before computing error.

Closing the Gap: System Identification (SysID)

If the Gap is large (Error > 10cm), we must tune the simulator. We treat the Simulator Parameters ($\theta = [mass, friction, drag]$) as hyperparameters to optimize.

Algorithm: CMA-ES for SysID

Goal: Find $\theta$ that minimizes $Error(T_{real}, T_{sim}(\theta))$.
Sample population of $\theta$ (e.g. friction=0.5, 0.6, 0.7).
Run Sim for each.
Compute Error against Real Logs.
Update distribution of $\theta$ towards the best candidates.
Repeat.

Residual Physics Networks

Sometimes Sim can never match Real (e.g. complex aerodynamics). We learn a residual term: $$ s_{t+1} = Sim(s_t, a_t; \theta) + \delta(s_t, a_t; \phi) $$ $\delta$ is a small Neural Network trained on Real Data residuals.

class ResidualPhysics(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim + action_dim, 64),
            nn.ReLU(),
            nn.Linear(64, state_dim) # Output: Delta s
        )
    
    def forward(self, state, action):
        return self.fc(torch.cat([state, action], dim=1))

Infrastructure: The Evaluation Loop

We need a continuous loop that runs every night.

[ Real Robot Lab ]
       |
       +---> [ Daily Log Upload ] ---> [ S3: /logs/real ]
                                         |
[ Eval Cluster ] <-----------------------+
       |
       +---> [ Run Replay in Sim ] ---> [ S3: /logs/sim ]
       |
       +---> [ Compute FID / NRMSE ]
       |
       v
[ Dashboard (Grafana) ]
   "Reality Gap: 12.5 (Good)"
   "Friction Est: 0.65"

Troubleshooting: “My Simulator is Perfect” (It is not)

Scenario 1: The “Uncanny Valley” of Physics

Symptom: You modeled every gear and screw. SysID error is still high.
Cause: Unmodeled dynamics. e.g., Cable drag (wires pulling the arm), grease viscosity changing with temperature.
Fix: Add a “Residual Network” Term. $NextState = Sim(s, a) + NN(s, a)$. The NN learns the unmodeled physics.

Scenario 2: Sensor Noise Mismatch

Symptom: Sim perfectly tracks Real robot position, but Policy fails.
Cause: Real sensors have Gaussian Noise. Sim sensors are perfect.
Fix: Inject noise in Sim. obs = obs + normal(0, 0.01). Tune noise magnitude to match Real Sensor datasheets.

Scenario 3: The Overfitted Residual

Symptom: Residual Network fixes the error on training set, but robot goes crazy in new poses.
Cause: The NN learned to memorize the trajectory errors rather than the physics.
Fix: Regularize the Residual Network. Keep it small (2 layers). Add dropout.

Future Trends: Real-to-Sim GAN

Instead of tuning Sim to match Real manually… Train a CycleGAN to translate Sim Images to “Real-style” images.

Train Policy on “Sim-Translated-to-Real” images.
This closes the Visual Gap automatically.

MLOps Interview Questions

Q: Why use InceptionV3 for FID? Why not ResNet? A: Convention. InceptionV3 was trained on ImageNet and captures high-level semantics well. Changing the backbone breaks comparability with literature.
Q: What is “Domain Adaptation” vs “Domain Randomization”? A: Randomization: Make Sim diverse so Real is a subset. Adaptation: Make Sim look like Real (Sim2Real GAN) or make Real look like Sim (Canonicalization).
Q: Can you do SysID Online? A: Yes. “Adaptive Control”. The robot estimates mass/friction while moving. If it feels heavy, it updates its internal model $\hat{m}$ and increases torque.
Q: How do you handle “Soft deformable objects” in Sim? A: Extremely hard. Cloth/Fluids are computationally expensive. Usually we don’t Sim them; we learn a Policy that is robust to their deformation (by randomizing visual shape).
Q: What is a “Golden Run”? A: A verified Real World trajectory that we treat as Ground Truth. We replay this exact sequence in every new Simulator Version to ensure regression testing.

Glossary

FID (Frechet Inception Distance): Metric for distance between image distributions.
SysID (System Identification): Determining physical parameters (mass, friction) from observed data.
CMA-ES: Covariance Matrix Adaptation Evolution Strategy. Derivative-free optimization alg.
Residual Physics: Using ML to predict the error of a physics engine.

Summary Checklist

Data: Collect at least 1000 real-world images for a stable FID baseline.
Alignment: Implement cross-correlation time alignment for trajectory comparison.
Baselines: Measure the “Gap” of a random policy. Your trained policy gap should be significantly lower.
Thresholds: Set a “Red Light” CI threshold. If $Gap > 15%$, block the deployment.
Calibration: Calibrate your Real Robot sensors (Camera intrinsics) monthly. Bad calibration = Artificial Gap.

The MLOps Omni-Reference