Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

38.1. Environment Versioning & Sim2Real: The Foundation of RLOps

Status: Draft Version: 1.0.0 Tags: #RLOps, #Sim2Real, #Docker, #Rust Author: MLOps Team


Table of Contents

  1. The Theory of Sim2Real
  2. The Dependency Hell of RL
  3. Determinism: The “Seeding” Problem
  4. Sim2Real: Crossing the Gap
  5. Advanced: MuJoCo XML Templating
  6. Regression Testing for Simulators
  7. Infrastructure: Headless Rendering with EGL
  8. Glossary
  9. Summary Checklist

Prerequisites

Before diving into this chapter, ensure you have the following installed:

  • Rust (Carog/Rustc): 1.70+
  • Docker: 20.10+ with NVIDIA Container Runtime
  • Python: 3.10+ (for glue code)
  • MuJoCo Key: (No longer required as of 2022!)

The Theory of Sim2Real

In Supervised Learning, the dataset is static: a fixed folder of JPEGs. In Reinforcement Learning, the dataset is dynamic: it is generated on-the-fly by a Simulator. Therefore, The Simulator IS The Dataset.

If you change the simulator (physics engine update, friction coefficient), you have technically changed the dataset. Models trained on Sim v1.0 will fail on Sim v1.1. This is the First Law of RLOps.

The Mathematical Formulation

We aim to train a policy $\pi_\theta$ that maximizes reward under randomness $\xi$ (the simulator parameters).

$$ J(\theta) = \mathbb{E}{\xi \sim P(\xi)} [ \mathbb{E}{\tau \sim \pi_\theta, \cal{E}(\xi)} [ \sum_t \gamma^t r_t ] ] $$

Where:

  • $\xi$: Physics parameters (mass, friction, damping).
  • $P(\xi)$: The distribution of these parameters (Domain Randomization).
  • $\cal{E}(\xi)$: The environment configured with parameters $\xi$.

If $P(\xi)$ is wide enough to cover the true parameters $\xi_{real}$, then the policy should transfer zero-shot. This is the Robustness Hypothesis.


The Dependency Hell of RL

Simulators are notoriously fragile. They depend on a precarious stack of binaries.

+---------------------------------------------------+
|               RL Agent (Python/Rust)              |
+---------------------------------------------------+
|            Gym / DM_Control Bindings              |
+---------------------------------------------------+
|               Physics Engine (C++)                |
|           (MuJoCo, Bullet, PhysX)                 |
+---------------------------------------------------+
|              Rendering (OpenGL/EGL)               |
+---------------------------------------------------+
|              GPU Driver (Nvidia)                  |
+---------------------------------------------------+
|               Operating System                    |
+---------------------------------------------------+

If you update your Nvidia driver, your RL agent might stop learning because the rendering of the “State” changed slightly (e.g., a shadow moved by 1 pixel).

Solution: Dockerize the Environment

Do not rely on pip install gym. Build a monolithic container that freezes the physics engine and rendering stack.

# Dockerfile.rl_env
# Use a specific hash for reproducibility to prevent "latest" breaking things
FROM nvidia/opengl:1.2-glvnd-runtime-ubuntu20.04@sha256:d83d...

# -----------------------------------------------------------------------------
# 1. Install System Dependencies (The "Hidden" State)
# -----------------------------------------------------------------------------
RUN apt-get update && apt-get install -y \
    libosmesa6-dev \
    libgl1-mesa-glx \
    libglfw3 \
    patchelf \
    git \
    python3-pip \
    unzip \
    wget \
    ffmpeg

# -----------------------------------------------------------------------------
# 2. Install Physics Engine (MuJoCo 2.1.0)
# Note: Use a locked version. Do not download "latest".
# -----------------------------------------------------------------------------
WORKDIR /root
RUN mkdir -p .mujoco \
    && wget https://github.com/deepmind/mujoco/releases/download/2.1.0/mujoco210-linux-x86_64.tar.gz \
    && tar -xF mujoco210-linux-x86_64.tar.gz -C .mujoco \
    && rm mujoco210-linux-x86_64.tar.gz

# Add to path
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/.mujoco/mujoco210/bin

# -----------------------------------------------------------------------------
# 3. Install Python Deps
# -----------------------------------------------------------------------------
COPY requirements.txt .
RUN pip3 install -r requirements.txt

# -----------------------------------------------------------------------------
# 4. Copy Environment Code
# -----------------------------------------------------------------------------
COPY ./param_env /app/param_env
WORKDIR /app

# -----------------------------------------------------------------------------
# 5. Entrypoint
# Ensure EGL is used for headless rendering (Servers have no Monitor)
# -----------------------------------------------------------------------------
ENV MUJOCO_GL="egl"
CMD ["python3", "evaluate_policy.py"]

Determinism: The “Seeding” Problem

An RL experiment must be reproducible. If I run the same seed twice, I must get the exact same sequence of states.

Rust Wrapper for Deterministic Gym

Standard OpenAI Gym env.seed(42) is often insufficient because of floating point non-determinism in parallel physics solves. We create a robust Rust crate for this.

Project Structure:

rl-sim/
├── Cargo.toml
├── src/
│   ├── lib.rs
│   ├── env.rs
│   └── main.rs
└── Dockerfile

Cargo.toml:

[package]
name = "rl-sim"
version = "0.1.0"
edition = "2021"

[dependencies]
rand = "0.8"
rand_chacha = "0.3" # CSPRNG for reproducibility
sha2 = "0.10" # For state hashing
serde = { version = "1.0", features = ["derive"] }
serde_yaml = "0.9" # For Configs
anyhow = "1.0"

src/env.rs:

#![allow(unused)]
fn main() {
use rand::{Rng, SeedableRng};
use rand_chacha::ChaCha8Rng;
use sha2::{Digest, Sha256};
use serde::Serialize;

#[derive(Debug, Default, Serialize, Clone)]
pub struct State {
    qpos: Vec<f64>,
    qvel: Vec<f64>,
    observation: Vec<f64>,
}

pub struct DeterministicEnv {
    seed: u64,
    rng: ChaCha8Rng,
    step_count: u64,
}

impl DeterministicEnv {
    pub fn new(seed: u64) -> Self {
        Self {
            seed,
            rng: ChaCha8Rng::seed_from_u64(seed),
            step_count: 0,
        }
    }

    pub fn reset(&mut self) -> State {
        println!("Resetting Env with seed {}", self.seed);
        // Reset the internal simulator
        // In a real binding, you would call: mujoco_rs::reset(self.seed);
        self.step_count = 0;
        State::default()
    }
    
    // Hash check to verify synchronization
    // This is the "Merkle Tree" of your RL Episode
    pub fn state_hash(&self, state: &State) -> String {
        let mut hasher = Sha256::new();
        // Canonical serialization is crucial here!
        // Floating point variations must be handled.
        let serialized = serde_json::to_string(state).unwrap();
        hasher.update(serialized);
        format!("{:x}", hasher.finalize())
    }
}
}

Sim2Real: Crossing the Gap

A policy trained in a perfect simulation (Friction=1.0, Mass=1.0) will fail on a real robot (Friction=0.9, Mass=1.05). Sim2Real Gap is the overfitting to the simulation’s imperfections.

Strategy 1: Domain Randomization (DR)

Instead of helping the agent learn about just one world, we force it to learn about many worlds. If the agent learns to walk with Mass \in [0.8, 1.2], it will likely walk on the real robot (Mass=1.05).

Configuration Schema for DR

Manage DR distributions as config files, not hardcoded numbers.

# domain_rand_v3.yaml
randomization:
  physics:
    friction:
      distribution: "uniform"
      range: [0.5, 1.5]
    gravity:
      distribution: "normal"
      mean: -9.81
      std: 0.1
    mass_scaling:
      distribution: "log_uniform"
      range: [0.8, 1.5]
  visual:
    lighting_noise: 0.1
    camera_position_perturbation: [0.01, 0.01, 0.01]
    texture_swap_prob: 0.5

Rust Implementation: The Randomizer

#![allow(unused)]
fn main() {
use serde::Deserialize;
use rand_distr::{Normal, Uniform, Distribution};

#[derive(Deserialize, Debug)]
struct PhysicsConfig {
    friction_range: (f64, f64),
    gravity_std: f64,
}

pub struct EnvironmentRandomizer {
    config: PhysicsConfig,
}

impl EnvironmentRandomizer {
    pub fn randomize(&self, sim: &mut SimulatorStub) {
        let mut rng = rand::thread_rng();
        
        // 1. Sample Friction
        let fric_dist = Uniform::new(self.config.friction_range.0, self.config.friction_range.1);
        let friction = fric_dist.sample(&mut rng);
        sim.set_friction(friction);
        
        // 2. Sample Gravity
        let grav_dist = Normal::new(-9.81, self.config.gravity_std).unwrap();
        let gravity = grav_dist.sample(&mut rng);
        sim.set_gravity(gravity);
        
        println!("Randomized Sim: Fric={:.2}, Grav={:.2}", friction, gravity);
    }
}

// Stub for the actual physics engine binding
pub struct SimulatorStub;
impl SimulatorStub {
    fn set_friction(&mut self, f: f64) {}
    fn set_gravity(&mut self, g: f64) {}
}
}

Advanced: MuJoCo XML Templating

Usually, robots are defined in MJCF (XML). To randomize “Arm Length”, you must modify the XML at runtime.

Base Template (robot.xml.j2):

<mujoco model="robot">
  <compiler angle="radian" />
  <worldbody>
    <body name="torso" pos="0 0 {{ torso_height }}">
      <geom type="capsule" size="0.1" />
      <joint name="root" type="free" />
    </body>
  </worldbody>
</mujoco>

Rust XML Processor:

#![allow(unused)]
fn main() {
use tera::{Tera, Context};

pub fn generate_mjcf(torso_height: f64) -> String {
    let mut tera = Tera::default();
    tera.add_raw_template("robot", include_str!("robot.xml.j2")).unwrap();
    
    let mut context = Context::new();
    context.insert("torso_height", &torso_height);
    
    tera.render("robot", &context).unwrap()
}
}

Regression Testing for Simulators

Before you start a 1000-GPU training run, verify the simulator hasn’t broken.

  1. Golden Run: Store a trajectory (actions, states) from a known good version.
  2. Regression Test: Replay actions on the new version. Assert states_new == states_old.
#![allow(unused)]
fn main() {
#[test]
fn test_simulator_determinism() {
    let mut env = DeterministicEnv::new(42);
    let mut obs = env.reset();
    
    // Load golden actions
    let golden_actions = vec![0, 1, 0, 0, 1]; 
    let expected_final_obs_checksum = "a1b2c3d4...";
    
    for action in golden_actions {
        let (next_obs, ..) = env.step(action);
        obs = next_obs;
    }
    
    let checksum = env.state_hash(&obs);
    assert_eq!(checksum, expected_final_obs_checksum, "Simulator Divergence Detected!");
}
}

Infrastructure: Headless Rendering with EGL

For Vision-based RL, you must render pixels on the server.

  • X11: Hard to manage on servers.
  • EGL: The way to go. Offscreen rendering without a display.

Setup Check:

# Verify EGL visible
nvidia-smi
ls /usr/share/glvnd/egl_vendor.d/
# Should see 10_nvidia.json

Troubleshooting EGL: If you see gladLoadGL error: 0, it often means:

  1. Variable MESA_GL_VERSION_OVERRIDE is missing.
  2. libgl1 is trying to load Software Rasterizer (llvmpipe) instead of Nvidia driver.
  3. Fix: Ensure LD_LIBRARY_PATH points to /usr/lib/nvidia.

Glossary

  • Sim2Real: The process of transferring a simulation-trained policy to the real world.
  • Domain Randomization (DR): Training on a distribution of environments to improve robustness.
  • Dynamics Randomization: Changing Mass, Friction, Damping.
  • Visual Randomization: Changing Textures, Lights, Camera Pose.
  • Curriculum Learning: Gradually increasing the difficulty of the environment (e.g., drift range) during training.
  • MuJoCo: Multi-Joint dynamics with Contact. A popular physics engine.
  • Gym: The standard API (reset/step).

Summary Checklist

  1. Dockerize: Never run RL training on bare metal. Containerize the simulator.
  2. Seed Everything: Simulators, Random Number Generators, and Python Hash seeds.
  3. Golden Tests: Run a regression test on your environment before every training job.
  4. Configurable DR: Move randomization ranges to YAML files.
  5. Headless EGL: Ensure your render pipeline works without a monitor (X11 forwarding is brittle).
  6. Log Versions: When logging to WandB/MLflow, log the docker_image_sha of the environment.
  7. XML Templating: Use Jinja2/Tera to procedurally generate robot morphologies.