38.1. Environment Versioning & Sim2Real: The Foundation of RLOps

Status: Draft Version: 1.0.0 Tags: #RLOps, #Sim2Real, #Docker, #Rust Author: MLOps Team

The Theory of Sim2Real
The Dependency Hell of RL
Determinism: The “Seeding” Problem
Sim2Real: Crossing the Gap
Advanced: MuJoCo XML Templating
Regression Testing for Simulators
Infrastructure: Headless Rendering with EGL
Glossary
Summary Checklist

Prerequisites

Before diving into this chapter, ensure you have the following installed:

Rust (Carog/Rustc): 1.70+
Docker: 20.10+ with NVIDIA Container Runtime
Python: 3.10+ (for glue code)
MuJoCo Key: (No longer required as of 2022!)

In Supervised Learning, the dataset is static: a fixed folder of JPEGs. In Reinforcement Learning, the dataset is dynamic: it is generated on-the-fly by a Simulator. Therefore, The Simulator IS The Dataset.

If you change the simulator (physics engine update, friction coefficient), you have technically changed the dataset. Models trained on Sim v1.0 will fail on Sim v1.1. This is the First Law of RLOps.

The Mathematical Formulation

We aim to train a policy $\pi_\theta$ that maximizes reward under randomness $\xi$ (the simulator parameters).

$$ J(\theta) = \mathbb{E}{\xi \sim P(\xi)} [ \mathbb{E}{\tau \sim \pi_\theta, \cal{E}(\xi)} [ \sum_t \gamma^t r_t ] ] $$

Where:

$\xi$: Physics parameters (mass, friction, damping).
$P(\xi)$: The distribution of these parameters (Domain Randomization).
$\cal{E}(\xi)$: The environment configured with parameters $\xi$.

If $P(\xi)$ is wide enough to cover the true parameters $\xi_{real}$, then the policy should transfer zero-shot. This is the Robustness Hypothesis.

The Dependency Hell of RL

Simulators are notoriously fragile. They depend on a precarious stack of binaries.

+---------------------------------------------------+
|               RL Agent (Python/Rust)              |
+---------------------------------------------------+
|            Gym / DM_Control Bindings              |
+---------------------------------------------------+
|               Physics Engine (C++)                |
|           (MuJoCo, Bullet, PhysX)                 |
+---------------------------------------------------+
|              Rendering (OpenGL/EGL)               |
+---------------------------------------------------+
|              GPU Driver (Nvidia)                  |
+---------------------------------------------------+
|               Operating System                    |
+---------------------------------------------------+

If you update your Nvidia driver, your RL agent might stop learning because the rendering of the “State” changed slightly (e.g., a shadow moved by 1 pixel).

Solution: Dockerize the Environment

Do not rely on pip install gym. Build a monolithic container that freezes the physics engine and rendering stack.

# Dockerfile.rl_env
# Use a specific hash for reproducibility to prevent "latest" breaking things
FROM nvidia/opengl:1.2-glvnd-runtime-ubuntu20.04@sha256:d83d...

# -----------------------------------------------------------------------------
# 1. Install System Dependencies (The "Hidden" State)
# -----------------------------------------------------------------------------
RUN apt-get update && apt-get install -y \
    libosmesa6-dev \
    libgl1-mesa-glx \
    libglfw3 \
    patchelf \
    git \
    python3-pip \
    unzip \
    wget \
    ffmpeg

# -----------------------------------------------------------------------------
# 2. Install Physics Engine (MuJoCo 2.1.0)
# Note: Use a locked version. Do not download "latest".
# -----------------------------------------------------------------------------
WORKDIR /root
RUN mkdir -p .mujoco \
    && wget https://github.com/deepmind/mujoco/releases/download/2.1.0/mujoco210-linux-x86_64.tar.gz \
    && tar -xF mujoco210-linux-x86_64.tar.gz -C .mujoco \
    && rm mujoco210-linux-x86_64.tar.gz

# Add to path
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/.mujoco/mujoco210/bin

# -----------------------------------------------------------------------------
# 3. Install Python Deps
# -----------------------------------------------------------------------------
COPY requirements.txt .
RUN pip3 install -r requirements.txt

# -----------------------------------------------------------------------------
# 4. Copy Environment Code
# -----------------------------------------------------------------------------
COPY ./param_env /app/param_env
WORKDIR /app

# -----------------------------------------------------------------------------
# 5. Entrypoint
# Ensure EGL is used for headless rendering (Servers have no Monitor)
# -----------------------------------------------------------------------------
ENV MUJOCO_GL="egl"
CMD ["python3", "evaluate_policy.py"]

Determinism: The “Seeding” Problem

An RL experiment must be reproducible. If I run the same seed twice, I must get the exact same sequence of states.

Rust Wrapper for Deterministic Gym

Standard OpenAI Gym env.seed(42) is often insufficient because of floating point non-determinism in parallel physics solves. We create a robust Rust crate for this.

Project Structure:

rl-sim/
├── Cargo.toml
├── src/
│   ├── lib.rs
│   ├── env.rs
│   └── main.rs
└── Dockerfile

Cargo.toml:

[package]
name = "rl-sim"
version = "0.1.0"
edition = "2021"

[dependencies]
rand = "0.8"
rand_chacha = "0.3" # CSPRNG for reproducibility
sha2 = "0.10" # For state hashing
serde = { version = "1.0", features = ["derive"] }
serde_yaml = "0.9" # For Configs
anyhow = "1.0"

src/env.rs:

#![allow(unused)]
fn main() {
use rand::{Rng, SeedableRng};
use rand_chacha::ChaCha8Rng;
use sha2::{Digest, Sha256};
use serde::Serialize;

#[derive(Debug, Default, Serialize, Clone)]
pub struct State {
    qpos: Vec<f64>,
    qvel: Vec<f64>,
    observation: Vec<f64>,
}

pub struct DeterministicEnv {
    seed: u64,
    rng: ChaCha8Rng,
    step_count: u64,
}

impl DeterministicEnv {
    pub fn new(seed: u64) -> Self {
        Self {
            seed,
            rng: ChaCha8Rng::seed_from_u64(seed),
            step_count: 0,
        }
    }

    pub fn reset(&mut self) -> State {
        println!("Resetting Env with seed {}", self.seed);
        // Reset the internal simulator
        // In a real binding, you would call: mujoco_rs::reset(self.seed);
        self.step_count = 0;
        State::default()
    }
    
    // Hash check to verify synchronization
    // This is the "Merkle Tree" of your RL Episode
    pub fn state_hash(&self, state: &State) -> String {
        let mut hasher = Sha256::new();
        // Canonical serialization is crucial here!
        // Floating point variations must be handled.
        let serialized = serde_json::to_string(state).unwrap();
        hasher.update(serialized);
        format!("{:x}", hasher.finalize())
    }
}
}

# domain_rand_v3.yaml
randomization:
  physics:
    friction:
      distribution: "uniform"
      range: [0.5, 1.5]
    gravity:
      distribution: "normal"
      mean: -9.81
      std: 0.1
    mass_scaling:
      distribution: "log_uniform"
      range: [0.8, 1.5]
  visual:
    lighting_noise: 0.1
    camera_position_perturbation: [0.01, 0.01, 0.01]
    texture_swap_prob: 0.5

Rust Implementation: The Randomizer

#![allow(unused)]
fn main() {
use serde::Deserialize;
use rand_distr::{Normal, Uniform, Distribution};

#[derive(Deserialize, Debug)]
struct PhysicsConfig {
    friction_range: (f64, f64),
    gravity_std: f64,
}

pub struct EnvironmentRandomizer {
    config: PhysicsConfig,
}

impl EnvironmentRandomizer {
    pub fn randomize(&self, sim: &mut SimulatorStub) {
        let mut rng = rand::thread_rng();
        
        // 1. Sample Friction
        let fric_dist = Uniform::new(self.config.friction_range.0, self.config.friction_range.1);
        let friction = fric_dist.sample(&mut rng);
        sim.set_friction(friction);
        
        // 2. Sample Gravity
        let grav_dist = Normal::new(-9.81, self.config.gravity_std).unwrap();
        let gravity = grav_dist.sample(&mut rng);
        sim.set_gravity(gravity);
        
        println!("Randomized Sim: Fric={:.2}, Grav={:.2}", friction, gravity);
    }
}

// Stub for the actual physics engine binding
pub struct SimulatorStub;
impl SimulatorStub {
    fn set_friction(&mut self, f: f64) {}
    fn set_gravity(&mut self, g: f64) {}
}
}

Advanced: MuJoCo XML Templating

Usually, robots are defined in MJCF (XML). To randomize “Arm Length”, you must modify the XML at runtime.

Base Template (robot.xml.j2):

<mujoco model="robot">
  <compiler angle="radian" />
  <worldbody>
    <body name="torso" pos="0 0 {{ torso_height }}">
      <geom type="capsule" size="0.1" />
      <joint name="root" type="free" />
    </body>
  </worldbody>
</mujoco>

Rust XML Processor:

#![allow(unused)]
fn main() {
use tera::{Tera, Context};

pub fn generate_mjcf(torso_height: f64) -> String {
    let mut tera = Tera::default();
    tera.add_raw_template("robot", include_str!("robot.xml.j2")).unwrap();
    
    let mut context = Context::new();
    context.insert("torso_height", &torso_height);
    
    tera.render("robot", &context).unwrap()
}
}

Regression Testing for Simulators

Before you start a 1000-GPU training run, verify the simulator hasn’t broken.

Golden Run: Store a trajectory (actions, states) from a known good version.
Regression Test: Replay actions on the new version. Assert states_new == states_old.

#![allow(unused)]
fn main() {
#[test]
fn test_simulator_determinism() {
    let mut env = DeterministicEnv::new(42);
    let mut obs = env.reset();
    
    // Load golden actions
    let golden_actions = vec![0, 1, 0, 0, 1]; 
    let expected_final_obs_checksum = "a1b2c3d4...";
    
    for action in golden_actions {
        let (next_obs, ..) = env.step(action);
        obs = next_obs;
    }
    
    let checksum = env.state_hash(&obs);
    assert_eq!(checksum, expected_final_obs_checksum, "Simulator Divergence Detected!");
}
}

Infrastructure: Headless Rendering with EGL

For Vision-based RL, you must render pixels on the server.

X11: Hard to manage on servers.
EGL: The way to go. Offscreen rendering without a display.

Setup Check:

# Verify EGL visible
nvidia-smi
ls /usr/share/glvnd/egl_vendor.d/
# Should see 10_nvidia.json

Troubleshooting EGL: If you see gladLoadGL error: 0, it often means:

Variable MESA_GL_VERSION_OVERRIDE is missing.
libgl1 is trying to load Software Rasterizer (llvmpipe) instead of Nvidia driver.
Fix: Ensure LD_LIBRARY_PATH points to /usr/lib/nvidia.

Glossary

Sim2Real: The process of transferring a simulation-trained policy to the real world.
Domain Randomization (DR): Training on a distribution of environments to improve robustness.
Dynamics Randomization: Changing Mass, Friction, Damping.
Visual Randomization: Changing Textures, Lights, Camera Pose.
Curriculum Learning: Gradually increasing the difficulty of the environment (e.g., drift range) during training.
MuJoCo: Multi-Joint dynamics with Contact. A popular physics engine.
Gym: The standard API (reset/step).

Summary Checklist

Dockerize: Never run RL training on bare metal. Containerize the simulator.
Seed Everything: Simulators, Random Number Generators, and Python Hash seeds.
Golden Tests: Run a regression test on your environment before every training job.
Configurable DR: Move randomization ranges to YAML files.
Headless EGL: Ensure your render pipeline works without a monitor (X11 forwarding is brittle).
Log Versions: When logging to WandB/MLflow, log the docker_image_sha of the environment.
XML Templating: Use Jinja2/Tera to procedurally generate robot morphologies.

The MLOps Omni-Reference