41.2. Domain Randomization & Synthetic Data

Status: Draft Version: 1.0.0 Tags: #Sim2Real, #DataGen, #Python, #ZeroMQ, #ComputerVision Author: MLOps Team

The “Reality Gap” Dilemma
Taxonomy of Randomization
Configuration as Code: The DR Schema
Python Implementation: Remote Control DataGen
Unity Side: The Command Listener
Visual vs Dynamics Randomization
Infrastructure: Massive Parallel Data Generation
Troubleshooting: Common Artifacts
Future Trends: Differentiable Simulation
MLOps Interview Questions
Glossary
Summary Checklist

Prerequisites

Before diving into this chapter, ensure you have the following installed:

Python: pyzmq (ZeroMQ), pydantic.
Unity: A scene with a movable object.

If you train a Robot Arm to pick up a Red Cube in a White Room, and then deploy it to a Red Cube in a Beige Room, it fails. Neural Networks overfit to the simulator’s specific rendering artifacts and physics biases.

Solution: Domain Randomization (DR) Instead of trying to make the simulation perfect (Photorealism), we make it diverse. We randomize textures, lighting, camera angles, friction, and mass. If the model sees 10,000 variations, the “Real World” just becomes the 10,001st variation.

Taxonomy of Randomization

Visual Randomization: Changing colors, textures, lighting intensity, glare.
- Goal: Invariance to lighting conditions.
Dynamics Randomization: Changing mass, friction, damping, joint limits.
- Goal: Robustness to hardware wear and tear.
Procedural Generation: Changing the topology of the world (Room dimensions, Obstacle placement).
- Goal: Generalization to new environments.

Configuration as Code: The DR Schema

We define the randomization distribution in a JSON/YAML file. This is our “Dataset Definition”.

from pydantic import BaseModel, Field
from typing import List, Tuple

class LightConfig(BaseModel):
    # Tuple[min, max]
    intensity_range: Tuple[float, float] = (0.5, 2.0)
    # Hue Jitter amount (0.0 = no color change, 1.0 = full rainbow)
    color_hsv_jitter: float = 0.1

class ObjectConfig(BaseModel):
    # Dynamic properties are Critical for contact-rich tasks
    mass_range: Tuple[float, float] = (0.1, 5.0)
    friction_range: Tuple[float, float] = (0.5, 0.9)
    # Visual properties
    scale_range: Tuple[float, float] = (0.8, 1.2)
    # How many distractor objects to spawn
    distractor_count: int = 5

class ScenarioConfig(BaseModel):
    version: str = "1.0.0"
    seed: int = 42
    lighting: LightConfig
    objects: ObjectConfig

Python Implementation: Remote Control DataGen

We don’t want to write C# logic for MLOps. We want to control Unity from Python. We use ZeroMQ (Request-Reply pattern).

Project Structure

datagen/
├── main.py
├── schema.py
└── client.py

client.py:

import zmq
import json
import time
from schema import ScenarioConfig

class SimClient:
    """
    SimClient acts as the 'God Mode' controller for the simulation.
    It tells Unity exactly what to spawn and where.
    """
    def __init__(self, port: int = 5555):
        self.context = zmq.Context()
        self.socket = self.context.socket(zmq.REQ)
        # Unity runs inside Docker, mapped to localhost:5555
        self.socket.connect(f"tcp://localhost:{port}")
        
    def send_command(self, cmd: str, data: dict):
        payload = json.dumps({"command": cmd, "data": data})
        self.socket.send_string(payload)
        
        # Blocking wait for Unity to confirm. 
        # This ensures frame-perfect synchronization.
        reply = self.socket.recv_string()
        return json.loads(reply)

    def randomize_scene(self, config: ScenarioConfig):
        # 1. Randomize Lights
        self.send_command("set_lighting", {
            "intensity": 1.5, # In real app, sample from config.lighting
            "color": [1.0, 0.9, 0.8]
        })
        
        # 2. Spawn Objects
        for i in range(config.objects.distractor_count):
            self.send_command("spawn_object", {
                "id": i,
                "type": "cube",
                "mass": 2.5,
                "position": [0, 0, 0] # TODO: Add random position logic
            })
            
        # 3. Capture Frame
        # After randomization is applied, we take the photo.
        return self.send_command("capture_frame", {})

Unity Side: The Command Listener

In Unity, we attach a C# script to a GameObject that listens on port 5555.

using UnityEngine;
using NetMQ;
using NetMQ.Sockets;
using Newtonsoft.Json.Linq;
using System.IO;

// Requires AsyncIO and NetMQ DLLs in the Plugins folder
public class ZeroMQListener : MonoBehaviour
{
    private ResponseSocket server;
    public Light sceneLight;
    private bool running = true;

    void Start()
    {
        // Required for NetMQ initialization on some platforms
        AsyncIO.ForceDotNet.Force();
        server = new ResponseSocket("@tcp://*:5555");
        Debug.Log("ZeroMQ Listener started on port 5555");
    }

    void Update()
    {
        if (!running) return;

        // Non-blocking poll in the game loop
        // We handle one request per frame to ensure stability
        string message = null;
        if (server.TryReceiveFrameString(out message))
        {
            var json = JObject.Parse(message);
            string cmd = (string)json["command"];
            
            if (cmd == "set_lighting")
            {
                float intensity = (float)json["data"]["intensity"];
                sceneLight.intensity = intensity;
                // Acknowledge receipt
                server.SendFrame("{\"status\": \"ok\"}");
            }
            else if (cmd == "capture_frame")
            {
                // Trigger ScreenCapture
                // Note: Capturing usually takes 1 frame to render
                string path = Path.Combine(Application.persistentDataPath, "img_0.png");
                ScreenCapture.CaptureScreenshot(path);
                
                server.SendFrame($"{{\"path\": \"{path}\"}}");
            }
            else 
            {
                 server.SendFrame("{\"error\": \"unknown_command\"}");
            }
        }
    }

    void OnDestroy()
    {
        running = false;
        server?.Dispose();
        NetMQConfig.Cleanup();
    }
}

Visual vs Dynamics Randomization

Visual (Texture Swapping)

Technique: Use MaterialPropertyBlock in Unity to change colors without creating new materials (avoids GC).
Advanced: Use “Triplanar Mapping” shaders so textures don’t stretch when we scale objects.

Dynamics (Physics Fuzzing)

Technique: Modifying Rigidbody.mass and PhysicMaterial.dynamicFriction at the start of every episode.
Danger: If you randomize gravity to be negative, the robot flies away.
Bounds: Always sanity check random values. Mass > 0. Friction [0, 1].

Infrastructure: Massive Parallel Data Generation

Generating 1 Million synthetic images on a laptop takes forever. We scale out using Kubernetes Jobs.

[ Orchestrator (Python) ]
       |
       +---> [ Job 1: Seed 0-1000 ] --> [ Unity Pod ] --> [ S3 Bucket /batch_1 ]
       |
       +---> [ Job 2: Seed 1000-2000 ] --> [ Unity Pod ] --> [ S3 Bucket /batch_2 ]
       |
       ...
       +---> [ Job N ]

Key Requirement: Deterministic Seeding. Job 2 MUST produce distinctive data from Job 1. Seed = JobIndex * 1000 + EpisodeIndex.

Troubleshooting: Common Artifacts

Scenario 1: The “Disco Effect” (Epilepsy)

Symptom: The robot sees a world that changes colors every frame.
Cause: You are randomizing Visuals every timestep (Update()) instead of every episode (OnEpisodeStart()).
Fix: Only randomize visuals when the environment resets. Dynamics can be randomized continually (to simulate wind), but visuals usually shouldn’t flicker.

Scenario 2: Physics Explosion

Symptom: Objects fly violently apart at $t=0$.
Cause: You spawned objects overlapping each other. The Physics Engine resolves the collision by applying infinite force.
Fix: Use “Poisson Disk Sampling” to place objects with guaranteed minimum distance. Or enable Physics.autoSimulation = false until placement is verified.

Scenario 3: The Material Leak

Symptom: Memory usage grows by 100MB per episode. OOM after 1 hour.
Cause: GetComponent<Renderer>().material.color = Random.ColorHSV. Accessing .material creates a copy of the material. Unity does not garbage collect materials automatically.
Fix: Use GetComponent<Renderer>().SetPropertyBlock(mpb) instead of modifying materials directly. Or call Resources.UnloadUnusedAssets() periodically.

Scenario 4: Z-Fighting

Symptom: Flickering textures where the floor meets the wall.
Cause: Two planes occupy the exact same coordinate.
Fix: Randomize positions with a small epsilon (0.001). Add “jitter” to everything.

Future Trends: Differentiable Simulation

DR is “Black Box”. We guess distributions. Differentiable Physics (Brax, Dojo): We can backpropagate through the physics engine. $Loss = (RealWorld - SimWorld)^2$. $\nabla_{friction} Loss$ tells us exactly how to tune the simulator friction to match reality.

MLOps Interview Questions

Q: What is “Curriculum Learning” in DR? A: Start with easy randomization (gravity=9.8, friction=0.5). Once the robot learns, expand the range to [5.0, 15.0] and [0.1, 0.9]. This prevents the agent from failing early and learning nothing.
Q: How do you validate Synthetic Data? A: Train a model on Synthetic. Test it on Real (small validation set). If performance correlates, your data is good. If not, you have a “Sim2Real Gap”.
Q: Explain “Automatic Domain Randomization” (ADR). A: An RL Algorithm (like OpenAI used for Rubik’s Cube) that automatically expands the randomization bounds as the agent gets better. It removes the need for manual tuning.
Q: Why ZeroMQ over HTTP? A: Latency and Overhead. HTTP (JSON/Rest) creates a new connection per request. ZeroMQ keeps a persistent TCP connection and packs binary frames. For 60Hz control, HTTP is too slow.
Q: How do you handle “Transparent Objects”? A: Depth sensors fail on glass. Simulation renders glass perfectly. To match reality, we must introduce “Sensor Noise” models that simulate the failure modes of RealSense cameras on transparent surfaces.

Glossary

DR (Domain Randomization): Varying simulation parameters to improve generalization.
Sim2Real Gap: The drop in performance when moving from Sim to Physical world.
ZeroMQ: High-performance asynchronous messaging library.
MaterialPropertyBlock: Unity API for efficient per-object material overrides.
Differentiable Physics: A physics engine where every operation is differentiable (like PyTorch).

Summary Checklist

Protocol: Use Protobuf or Flatbuffers over ZeroMQ for type safety, not raw JSON.
Halt Physics: Pause simulation (Time.timeScale = 0) while applying randomization to prevent physics glitches during setup.
Metadata: Save the JSON config alongside the image. img_0.png + img_0.json (contains pose, mass, lighting).
Distribution: Use Beta Distributions instead of Uniform for randomization. Reality is rarely Uniform.
sanity Check: Always render a “Human View” occasionally to verify the randomization doesn’t look broken (e.g. black sky).

The MLOps Omni-Reference