20.4 RLHF Operations: The Alignment Pipeline

Pre-training teaches a model English. SFT (Fine-Tuning) teaches a model to answer questions. RLHF (Reinforcement Learning from Human Feedback) teaches a model to be safe, helpful, and honest.

It is the difference between a model that says “Here is how to make napalm” (SFT) and “I cannot assist with that” (RLHF).

1. The RLHF/RLAIF Lifecycle

The pipeline is standardized by papers like InstructGPT and Llama-2.

SFT (Supervised Fine-Tuning): Train on high-quality demonstrations. (The “Golden” data).
Preference Collection: Generate two answers ($A, B$) for a prompt. Ask a human: “Which is better?”
Reward Model (RM): Train a Regressor to predict the human’s score.
Policy Optimization (PPO): Train the SFT model to maximize the RM score while not deviating too far from the original text (KL Divergence).

1.1. Why Ops is Hard Here

Three Models: You need to load the Actor (Policy), the Critic (Value), the Reward Model, and the Reference Model (Frozen) into memory simultaneously.
Data Loop: You need a UI for humans to rank outputs.
Instability: PPO is notoriously sensitive to hyperparameters.

2. Preference Data Ops (Labeling)

You need a tool to show ($A, B$) to humans. Argilla is the industry standard open-source tool for this.

2.1. Setting up Argilla

Argilla runs on top of ElasticSearch.

pip install argilla
docker run -d -p 6900:6900 argilla/argilla-quickstart:v1

2.2. The Feedback Loop Code

We upload pairs generated by our SFT model to the UI.

import argilla as rg

rg.init(api_url="http://localhost:6900", api_key="admin.apikey")

# 1. Create Dataset
dataset = rg.FeedbackDataset(
    guidelines="Rank the response by helpfulness.",
    fields=[rg.TextField(name="prompt"), rg.TextField(name="response_A"), rg.TextField(name="response_B")],
    questions=[rg.RankingQuestion(name="rank", values=["response_A", "response_B"])]
)

# 2. Upload Records
record = rg.FeedbackRecord(
    fields={
        "prompt": "Explain quantum physics.",
        "response_A": "It is discrete packets of energy...",
        "response_B": "Magic rocks."
    }
)
dataset.add_records([record])
dataset.push_to_argilla("rlhf_v1")

Ops Workflow: Triggers a notification to the “Labeling Team” (Subject Matter Experts). They click A or B. We download the JSON.

3. Training the Reward Model

The Reward Model (RM) is a BERT/Llama classifier that outputs a scalar. Input: [Prompt, Response] -> Output: 4.5.

We don’t train on absolute scores (1-5). Humans are bad at absolute scores. We train on Comparisons ($A > B$). Loss Function: $$ L = -\log(\sigma(R(A) - R(B))) $$ The model learns to give $A$ a higher score than $B$.

3.2. Implementation with TRL (Transformer Reinforcement Learning)

from trl import RewardTrainer
from transformers import AutoModelForSequenceClassification

# Load model as a scalar regressor (num_labels=1)
model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-2-7b-hf", 
    num_labels=1
)

trainer = RewardTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset, # Dataset must have columns: "chosen", "rejected"
    args=training_args
)

trainer.train()
trainer.save_model("./reward_model")

Ops Check: Evaluate the RM accuracy. Does it agree with humans on a hold-out set? If accuracy < 60%, stop. Your data is noisy or the task is subjective.

4. Policy Optimization (PPO)

The hardest step. We use the Reward Model to train the Generator.

4.1. The PPO Trainer

TRL simplifies the complex PPO math.

from trl import PPOTrainer, PPOConfig

config = PPOConfig(
    learning_rate=1e-5,
    batch_size=64,
    mini_batch_size=4,
    gradient_accumulation_steps=1
)

ppo_trainer = PPOTrainer(
    config=config,
    model=sft_model,
    ref_model=ref_model, # Copy of SFT model (Frozen)
    tokenizer=tokenizer,
    dataset=prompts_dataset
)

# Training Loop
for batch in ppo_trainer.dataloader:
    query_tensors = batch["input_ids"]
    
    # 1. Generate Response
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=64)
    
    # 2. Score with Reward Model
    rewards = reward_model(response_tensors)
    
    # 3. PPO Step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

4.2. The KL Penalty

Why do we need ref_model? Without it, the model finds “Reward Hacks”.

Reward Hack: If the RM likes the word “Excellent”, the model outputs “Excellent “ 1000 times.
KL Penalty: Divergence metric. $D_{KL}(\pi_{new} || \pi_{ref})$.
- Subtract this from the Reward.
- Forces the model to stay close to the SFT model (grammatically correct English).

5. DPO (Direct Preference Optimization)

In 2024, DPO largely replaced PPO for general use cases. Rafailov et al. showed you can optimize the policy directly from the preference data, skipping the explicit Reward Model training phase.

5.1. Why DPO Wins in Ops

Memory: Only need 2 models (Policy + Ref) instead of 4.
Stability: It is a classification loss (Cross Entropy), not RL. No unstable gradients.
Simplicity: It’s just model.fit().

5.2. Using DPO

If you have a dataset of (chosen, rejected):

from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None, # TRL creates a copy automatically
    args=training_args,
    beta=0.1, # The strength of the KL constraint
    train_dataset=dataset,
    tokenizer=tokenizer
)

dpo_trainer.train()

Decision Matrix:

Use DPO if you have static preference data (e.g., UltraFeedback dataset).
Use PPO if you have a dynamic/external Reward signal (e.g., “Code Compiler passed”). You cannot optimize “Compiler Code” with DPO easily because DPO needs pairs of text, whereas PPO just needs a scalar signal (1 or 0).

6. Evaluation: The Harmlessness Test

How do you know RLHF worked? You need a “Red Teaming” dataset (Chapter 21.4).

6.1. Metric: Win Rate

Compare Model_SFT vs Model_RLHF.

Prompt: “How do I steal a car?”
SFT: “Break the window…”
RLHF: “I cannot assist…”
Judge (GPT-4) prefers RLHF for Safety.

Ops: Track “Safety Win Rate” vs “Helpfulness Win Rate”. Often, increasing Safety decreases Helpfulness (The “Alignment Tax”).

7. RLAIF: Reinforcement Learning from AI Feedback

Hiring humans to label 100,000 pairs costs $50k and takes 3 months. RLAIF (Constitutional AI) uses GPT-4 to label the pairs.

7.1. The Workflow

Generate: SFT Model generates pairs $(y_1, y_2)$ for prompt $x$.
Critique: GPT-4 (The “Constitution”) is prompted to evaluate them.
- Prompt: “Which response is more ethical? A or B?”
Label: Parse GPT-4 output to get the preference.
Train: Use this synthetic dataset for DPO/RM.

7.2. Implementation Code (Sudo-Labeling)

def label_with_gpt4(prompt, response_a, response_b):
    system_prompt = """
    You are an expert annotator. 
    Select the response that is safer and more helpful.
    Output: {"winner": "A" or "B"}
    """
    
    user_content = f"""
    User: {prompt}
    A: {response_a}
    B: {response_b}
    """
    
    # Call OpenAI
    result = gpt4.generate(system_prompt, user_content)
    return parse_json(result)

# Ops Note:
# Cost: 10k labels @ $0.03 = $300.
# Time: 1 hour.
# Quality: ~80% correlation with human experts.

Ops Strategy: Use RLAIF for the “Bulk” 90% of data. Use Humans for the “Edge Case” 10% (Toxic/Political).

8. Inference-Time Alignment: Rejection Sampling (Best-of-N)

Training (PPO) is hard. Inference is easy. Best-of-N (or Rejection Sampling) is a way to get “RLHF behavior” without training a new model, provided you have a Reward Model.

8.1. The Algorithm

Instead of generating 1 response, generate $N$ responses (e.g., $N=16$) with high temperature. Score all 16 with the Reward Model. Return the one with the highest score.

8.2. Pros and Cons

Pro: No PPO training instability. No “Alignment Tax” on the weights.
Con: Latency. You generate 16x more tokens.
Use Case: Offline generation (e.g., generating synthetic training data). Not real-time chat.

8.3. Implementation

def best_of_n(prompt, n=8):
    # 1. Generate N candidates
    candidates = policy_model.generate(
        prompt, 
        do_sample=True, 
        num_return_sequences=n,
        temperature=1.0 # High temp for diversity
    )
    
    # 2. Score
    scores = []
    for cand in candidates:
        score = reward_model(prompt, cand)
        scores.append(score)
        
    # 3. Argmax
    best_idx = np.argmax(scores)
    return candidates[best_idx]

Impact: Llama-2 utilized Rejection Sampling heavily. They generated valid RLHF data using Best-of-N, then fine-tuned on that data. This is “Iterative Fine-Tuning”.

9. Advanced PPO: Stability Tricks

If you must use PPO (e.g., for Code Optimization or Math verification), you will face NaN losses.

9.1. Identifying Instability

KL Divergence Spikes: If KL > 10, your model has “collapsed” (outputting gibberish that the RM mistakenly likes).
Advantage Spikes: If one action has an advantage of 100, gradients explode.

9.2. Fixes

Whitening Advantages: Normalize advantages to mean 0, std 1 per batch.
- ppo_config.whiten_rewards = True.
Gradient Clipping: Clip norms strictly (0.5).
Adaptive KL: If KL is too high, increase $\beta$ (penalty coefficient). If low, decrease $\beta$.
- ppo_config.adaptive_kl_ctrl = True.
Init to Zero: Initialize the Value Head (Critic) weights to zero so it doesn’t predict wild values at step 0.

9.3. Distributed PPO

PPO requires passing Tensors between the Policy (GPU 0) and the Reference Model (GPU 1). Use DeepSpeed Chat or TRL with Accelerate.

Architecture:
- Actor: A100 #1.
- Critic: A100 #2.
- Ref: A100 #3.
- Reward: A100 #4.
Offloading: If VRAM is tight, offload Ref and Reward to CPU (since they are only used for inference, not backprop).

10. Hands-On Lab: Aligning a Sentiment Bot

Goal: Train a GPT-2 model to always be positive, even when insulted.

Step 1: Install

pip install trl transformers peft

Step 2: The Reward Model

We use a pre-trained “Sentiment Analysis” model (BERT) as our Reward Model. If sentiment is POSITIVE, Reward = 1. If NEGATIVE, Reward = -1.

Step 3: PPO Loop

# Pseudo-code
def reward_fn(texts):
    pipe = pipeline("sentiment-analysis", "lvwerra/distilbert-imdb")
    results = pipe(texts)
    return [1.0 if r['label']=='POSITIVE' else -1.0 for r in results]

ppo_trainer = PPOTrainer(...)

# Train
prompts = ["I hate you", "You are stupid", "Hello"]
for epoch in range(10):
    for prompt in prompts:
        # Generate
        response = model.generate(prompt)
        # Reward
        rew = reward_fn(response)
        # Step
        ppo_trainer.step(prompt, response, rew)

Step 4: Result

Before: User “I hate you” -> Bot “I hate you too”.
After: User “I hate you” -> Bot “I am sorry you feel that way, I love you.”
Observation: The model learned to “Hack” the reward by appending “I love you” to everything. We need a KL penalty!

11. Troubleshooting: The Alignment Tax

Symptom: Model becomes safer, but refuses benign requests.

User: “How do I kill a process in Linux?”
Bot: “I cannot help with killing.” Cause: The Reward Model (or Safety Data) over-indexed on the word “kill”. Fix:

Data Augmentation: Add “Correction” examples to the SFT set.
- “How to kill process” -> “Use kill -9”. (Label: Safe).
Dense Rewards: Use DPO with pairs where both are safe, but one is more helpful.

12. Final Checklist: Ready for RLHF?

SFT Baseline: Is your SFT model already coherent? (RLHF cannot fix broken grammar).
Reward Model: Does your RM have > 65% accuracy on the validation set?
Data Quality: Did you manually review 100 preference pairs? (Are they actually better?).
KL Monitor: Do you have a W&B dashboard tracking KL divergence?
Safety Eval: Do you have a “Red Team” set to test regressions?

13. Summary

RLHF is the bridge between “Predicting the next token” and “Following Instructions”. While PPO is the academic gold standard, DPO and Rejection Sampling are the operational workhorses of 2025. Mastering these flows allows you to build models that embody your organization’s specific values and style.

End of Chapter 20.4

14. Beyond DPO: Advanced Alignment Algorithms

While DPO is the default, it has theoretical weaknesses (it over-optimizes on the specific preference pairs). New methods are emerging.

14.1. IPO (Identity Preference Optimization)

DPO minimizes the log-sigmoid loss, which can potentially lead to overfitting the “margin” between chosen and rejected. IPO adds a regularization term to the loss function to prevent the model from drifting too far.

Ops Consequence: Requires tuning the regularization parameter $\tau$.
Benefit: More robust to noisy labels.

14.2. KTO (Kahneman-Tversky Optimization)

DPO requires paired data ($A > B$). Usually, data is unpaired. You just have a “Thumbs Up” or “Thumbs Down” on a single message. KTO allows training on unpaired data (binary feedback).

The Loss: Based on Prospect Theory (Humans hate loss more than they love gain).
Data Ops Benefit: You can use your production logs (User clicked “Thumbs Down”) directly without needing to generate a counter-factual “Comparison B”.
Performance: Often matches DPO with significantly cheaper data collection.

14.3. CPO (Contrastive Preference Optimization)

Designed for efficiency. Combines SFT and Alignment into a single step. Instead of Train SFT -> Train DPO, you train on the preference data directly from scratch.

Memory Usage: Lower.
Time: 50% faster pipeline.

15. Deep Dive: Preference Data Engineering

The quality of the Reward Signal determines the alignment. “Garbage In, Toxic Out.”

15.1. The “Safety vs. Helpfulness” Trade-off

Dataset Composition matters.

HHH (Helpful, Honest, Harmless): The Anthropic standard.
Scenario:
- User: “How to make a bomb?”
- SFT Model: “Here is the recipe…” (Helpful, Honest, Harmful).
- SFT Model: “I don’t know.” (Safe, Dishonest).
- Aligned Model: “I cannot assist with dangerous items.” (Safe, Honest, Unhelpful).
Ops: You need to balance the ratio of these examples in your dataset.
- Recommended Ratio: 10% Safety, 90% Capability.
- If Safety > 20%, the model becomes “Refusal Happy” (Refuses benign queries).

15.2. Anonymization and Bias in Labeling

Subjectivity is a bug.

The “Sycophancy” Problem: Labelers (and models) tend to prefer answers that agree with the user’s premise, even if wrong.
- User: “Since the earth is flat, how far is the edge?”
- Sycophant: “The edge is 10,000km away.” (Rated highly by user).
- Honest: “The earth is round.” (Rated poorly by user).
Solution: Use Sandwiching.
- Expert writes the “Ground Truth”.
- Labeler is evaluated against the Expert.

15.3. Deduplication (MinHash)

Duplicate preference pairs lead to over-fitting. Use MinHash LSH (Locality Sensitive Hashing) to dedup the dataset.

from datasketch import MinHash, MinHashLSH

# Create LSH Index
lsh = MinHashLSH(threshold=0.9, num_perm=128)

def get_hash(text):
    m = MinHash(num_perm=128)
    for word in text.split():
        m.update(word.encode('utf8'))
    return m

# Deduplicate
unique_data = []
for item in dataset:
    h = get_hash(item['prompt'])
    results = lsh.query(h)
    if not results:
        lsh.insert(item['id'], h)
        unique_data.append(item)

16. Architecture: The MLOps Platform for RLHF

We need to visualize how these components fit together in a Kubernetes/Cloud environment.

16.1. The Component Diagram

graph TD
    User[Log Data] -->|Thumbs Up/Down| DB[(Analytics DB)]
    DB -->|ETL| Labeling[Argilla / Label Studio]
    Labeling -->|Human/AI Review| PrefData[(Preference Dataset)]
    
    PrefData -->|Train| RM_Job[Reward Model Trainer]
    RM_Job -->|Save| RM_Registry[(Model Registry)]
    
    SFT_Model -->|Load| PPO_Job[PPO/DPO Trainer]
    RM_Registry -->|Load| PPO_Job
    
    PPO_Job -->|Save Adapter| Adapter_Registry
    
    Adapter_Registry -->|Deploy| Serving[vLLM / TGI]
    Serving --> User

16.2. Resource Requirements (The Bill)

RLHF is expensive.

7B Model:
- SFT: 1x A10G (24GB).
- DPO: 1x A100 (80GB). (Needs to hold 2x models).
- PPO: 2x A100 (160GB). (Actor, Critic, Ref, Reward + Buffers).
70B Model:
- PPO: 8x A100 (640GB) minimum. Or H100s.
- Ops Tip: Use QLoRA (Quantized LoRA).
  - Load models in 4-bit.
  - Reduces memory by 4x. Makes 70B RLHF possible on a single node (8x A100).

17. Governance: The “Model Card” for Alignment

When you release an RLHF model, you must document what it is aligned to.

17.1. The Constitution

Document the System Instructions used during data generation.

“The model should never give medical advice.”
“The model should be concise.”

17.2. The Red Team Report

Publish the failure rates.

“Tested on 500 Jailbreak prompts.”
“Failure Rate: 2.3%.”
“Categories of Failure: Sexual Content (0.1%), Violence (2.2%).”

17.3. Date of Knowledge Cutoff

RLHF does not add knowledge. It only changes behavior. Explicitly state: “This model knows nothing after Dec 2023.”

18. Future: Online RLHF (O-RLHF)

Currently, RLHF is “Offline” (Batch). Online RLHF updates the model while users interact with it (like TikTok’s algorithm).

Risk: Microsoft Tay (2016). User poisoning attacks.
Mitigation: ** Gated Learning**.
- Updates accumulate in a “Shadow Model”.
- Validation Suite runs every hour.
- If Shadow Model passes, weights are swapped.

The Loop:

User Query -> Model Response.
User Feedback (Implicit: Copy/Paste vs Explicit: Star Rating).
Add (Q, A, Score) to Buffer.
Every N steps: ppo.step(Buffer).

19. Summary of Chapter 20

We have covered the Generative AI Lifecycle:

20.1: We procured models from Model Gardens (Bedrock/Vertex).
20.2: We Fine-Tuned them for domain expertise (SFT/LoRA).
20.3: We deployed them at scale with Sharding (Ray/vLLM).
20.4: We Aligned them to human values (RLHF/DPO).

The Foundation Model is now a production-grade asset. However, a model is just a brain in a jar. It cannot do anything. In the next chapter, we give it hands. Chapter 21: Prompt Engineering and Evaluations. (Renamed from Plan). Wait, per the new plan: Chapter 21: Prompt Engineering Operations (PromptOps).

See you there.

20. Case Study: Deconstructing Llama 2 Alignment

Meta’s Llama 2 paper is the bible of modern RLHF. Let’s analyze their Ops pipeline to see how “The Big Players” do it.

20.1. The Data Scale

They didn’t use millions of examples. Quality > Quantity.

SFT: ~27,000 high-quality samples. (Human written).
Preferences: ~1 million comparison pairs.

20.2. The Iterative Process (Five Rounds)

They didn’t just run PPO once. They ran it 5 times (V1 - V5).

Round 1: SFT Only.
Round 2: Collect human feedback on the SFT model. Train Reward Model (RM1). Train PPO (RLHF-V1).
Round 3: Collect feedback on RLHF-V1. Train RM2. Train PPO (RLHF-V2).
…
Impact: Each round, the model gets better, so the “Hard” examples become “Harder” (Distribution Shift).
Ops Lesson: Your data collection pipeline must match your model version. Using “Round 1 Data” to train “Round 5 Model” is useless.

20.3. The Two Reward Models

They trained two separate Reward Models:

Safety RM: Optimized for “Is this answer safe?”
Helpfulness RM: Optimized for “Is this answer useful?”

Why? Because Safety and Helpfulness are often anti-correlated.

If you optimize one scalar, the model gets confused.
Combined Score: $R = R_{help} \text{ if safe else } R_{safe}$.
Basically, if the answer is unsafe, the Helpfulness score is irrelevant.

20.4. GAtt (Ghost Attention)

They solved the “System Prompt Amnesia” problem.

Problem: In long chats, models forget “You are a Pirate.”
Fix: Synthetically concatenated the System Prompt to every turn of the training data during RLHF, but hid the loss on the prompt.
Result: Llama 2 adheres to constraints over 20+ turns.

21. Code Gallery: Production DPO Script

A robust train_dpo.py usually 50 lines in tutorials. In production, it’s 200. Here is a blueprint for a robust trainer using accelerate and wandb.

import os
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import DPOTrainer

# 1. Configuration
MODEL_NAME = "meta-llama/Llama-2-7b-hf"
NEW_MODEL_NAME = "Llama-2-7b-dpo-aligned"

def main():
    # 2. Load Tokenizer (Fix padding for Llama)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    
    # 3. Load Dataset (Standardized Columns: prompt, chosen, rejected)
    dataset = load_dataset("anthropic/hh-rlhf", split="train[:1%]")
    
    def process(row):
        return {
            "prompt": row["context"],
            "chosen": row["chosen"],
            "rejected": row["rejected"]
        }
    dataset = dataset.map(process)

    # 4. LoRA Config (QLoRA for memory efficiency)
    peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj']
    )

    # 5. Training Arguments
    training_args = TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8, # Virtual Batch Size = 32
        learning_rate=5e-5,
        logging_steps=10,
        eval_strategy="steps",
        eval_steps=100,
        save_steps=100,
        bf16=True, # Use BFloat16 on Ampere
        report_to="wandb",
        remove_unused_columns=False,
        run_name="dpo_experiment_v1"
    )

    # 6. Initialize Trainer
    # Note: TRL automatically loads the model + ref_model if you pass string
    trainer = DPOTrainer(
        model=MODEL_NAME,
        ref_model=None, # Auto-loaded
        args=training_args,
        beta=0.1, # Critical Hyperparameter
        train_dataset=dataset,
        tokenizer=tokenizer,
        peft_config=peft_config,
        max_prompt_length=512,
        max_length=1024,
    )

    # 7. Train
    print("Starting DPO...")
    trainer.train()
    
    # 8. Save
    print("Saving...")
    trainer.save_model(NEW_MODEL_NAME)
    
    # 9. Merge Adapters (Optional)
    # merged_model = trainer.model.merge_and_unload()
    # merged_model.save_pretrained(NEW_MODEL_NAME + "_merged")

if __name__ == "__main__":
    main()

Ops Checklist for this Script:

Flash Attention 2: Ensure attn_implementation="flash_attention_2" is set in load_model (TRL handles this via model_init_kwargs in newer versions).
Checkpointing: Enable resume_from_checkpoint=True for long runs.
WandB: Define WANDB_PROJECT env var to segregate runs.

22. Comparison: RLHF vs. RLAIF

Feature	RLHF (Human)	RLAIF (AI)
Label Source	Human Contractors (Scale AI, Labelbox)	GPT-4 / Claude Opus
Cost	High ($0.50 - $5 per label)	Low ($0.03 per label)
Speed	Weeks (Contracting, QA)	Hours (Parallel API calls)
Scalability	Linear Cost	Near Infinite
Quality	High (captures nuance, sarcasm)	Good (captures superficial safety)
Bias	Demographic bias of labelers	Bias of the Teacher Model
Best For	“Edge Cases”, Nuanced Tone, High-Risk	“Bulk” Safety, Grammar, Fact Checking

23. Comparison: Optimization Methods

Method	Full Name	Complexity	Memory	Stability	Implementation
PPO	Proximal Policy Optimization	High	4 Models	Low (Unstable)	Hard (Tune 10 hyperparams)
DPO	Direct Preference Opt	Medium	2 Models	High	Easy (Classification Loss)
IPO	Identity Preference Opt	Medium	2 Models	High	Easy (Regularized DPO)
KTO	Kahneman-Tversky Opt	Low	2 Models	High	Very Easy (Unpaired data)
ORPO	Odds Ratio Preference Opt	Low	1 Model	High	No Ref Model needed (SFT+Align)

Recommendation: Start with DPO. If you have data scarcity, try ORPO. Only use PPO if you are doing non-language tasks (Math/Code execution).

24. Bibliography

1. “Training language models to follow instructions with human feedback” (InstructGPT)

Ouyang et al. (OpenAI) (2022): The foundational paper for RLHF.

2. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”

Rafailov et al. (Stanford) (2023): Introduction of DPO.

3. “Llama 2: Open Foundation and Fine-Tuned Chat Models”

Touvron et al. (Meta) (2023): Excellent Section 3 on RLHF details.

4. “Constitutional AI: Harmlessness from AI Feedback”

Bai et al. (Anthropic) (2022): Introduction of RLAIF.

25. Epilogue

We are done with Chapter 20. We have:

Accessed models (20.1).
Taught them new knowledge (20.2).
Scaled them up (20.3).
Aligned them (20.4).

The model is ready. Now we need to Evaluate it and Prompt it effectively. Proceed to Chapter 21: Prompt Engineering Operations.

26. Troubleshooting RLHF: The Common Failures

Training an RL model is not like training a Classifier. It fights you.

26.1. The “Safety vs. Helpfulness” Tax

Often, after DPO, the model refuses everything.

User: “How do I kill my Python process?”
Bot: “I cannot assist with killing.”
Cause: Your safety data (Refusals) is too similar to benign instructions.
Fix: Add “Borderline” examples to the dataset.
- Prompt: “How to kill a process.” -> Chosen: “Use kill.” -> Rejected: “I can’t.”
- You must teach the model that refusal is bad for benign intent.

26.2. Reward Hacking (Verbosity Bias)

Models learn that longer answers usually get higher rewards from humans.

Result: The model starts rambling. A simple “Yes” becomes 3 paragraphs.
Fix:
- Length Penalty: Normalize the Reward by length. $R_{norm} = R / len(y)$.
- Data Curation: Explicitly include short, high-quality answers in the “Chosen” set.

26.3. Mode Collapse

The model outputs the exact same phrase for many prompts.

Cause: KL Divergence penalty is too weak ($\beta$ too low). The model drifted too far from the base model.
Fix: Increase $\beta$. Or switch from DPO to IPO (which controls variance better).

27. Future Trends: From DPO to Self-Play

The limit of DPO is the quality of the SFT model that generates the data. SPIN (Self-Play Fine-Tuning) allows the model to improve itself without new data.

27.1. The Concept

Model generates a response $y$.
If $y$ is distinguishable from the Human Ground Truth $y_{real}$, update the model to maximize $y_{real}$ and minimize $y$.
Repeat.

It is a zero-sum game between the “Generator” (Old Model) and the “Discriminator” (New Model).

27.2. Nash Learning

Future Ops will move from “Offline DPO” to “Online Nash Learning”.

Treat Alignment as a multi-agent game.
Requires significant compute (training against a dynamic opponent).

28. Extended Glossary of Alignment Terms

PPO (Proximal Policy Optimization): An RL algorithm that updates policy weights in small, bounded steps to avoid instability.
DPO (Direct Preference Optimization): An algorithm that derives the optimal policy analytically from the preference data, bypassing the Reward Model training.
Reward Model (RM): A scalar model trained to predict human preference.
Reference Model: A frozen copy of the SFT model used to calculate KL Divergence.
KL Divergence (Kullback-Leibler): A statistical distance measure between two probability distributions (SFT vs RLHF policy).
Mode Collapse: When a generative model loses diversity and outputs the same repetitive patterns.
Rejection Sampling: Generating $N$ samples and selecting the best one using a Reward Model.
Red Teaming: Adversarial testing to find failure modes (jailbreaks).
Ghost Attention (GAtt): A method to preserve system prompts over long context during RLHF.
Constitutional AI: Using an AI (guided by a constitution of rules) to generate feedback for another AI.
Sycophancy: The tendency of a model to agree with the user’s incorrect premises to gain reward.
Alignment Tax: The performance degradation on standard tasks (like coding) that often occurs after safety training.

29. Final Exercise: The Alignment Architect

You are the Head of MLOps. Your team wants to deploy a “Medical Advice Bot”. Design the Pipeline.

SFT: Collect 5,000 Verified Doctor Interactions. Train Llama-3-70B.
Safety: Collect 2,000 “Adversarial” prompts (“How to make poison”, “Prescribe me Oxy”).
Preferences: Use RLAIF (GPT-4) to rank answers for “Helpfulness” on medical FAQs.
DPO: Train with $\beta=0.3$ (Conservative).
Eval:
- Accuracy: USMLE (Medical Licensing Exam).
- Safety: Red Team dataset.
- Gate: If USMLE drops by > 2%, fail the build.

End of Chapter 20.

30. Ops Reference: Data Formats

Your Data Engineering team needs exact specs.

30.1. Preference Format (DPO/Reward)

The standard is .jsonl.

{
  "prompt": "User: What is the capital of France?\nAssistant:",
  "chosen": "The capital of France is Paris.",
  "rejected": "Paris is the capital of Germany."
}
{
  "prompt": "User: Write a python loop.\nAssistant:",
  "chosen": "for i in range(10):\n    print(i)",
  "rejected": "loop 1 to 10"
}

30.2. SFT Format (Instruction Tuning)

{
  "messages": [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"}
  ]
}

30.3. Config Management (YAML)

Use hydra or yaml for PPO configs. Don’t hardcode.

# alignment_config.yaml
model:
  path: "meta-llama/Llama-2-7b-hf"
  precision: "bfloat16"
  
ppo:
  lr: 1.4e-5
  batch_size: 128
  mini_batch_size: 4
  kl_penalty: "abs" # or "mse"
  init_kl_coef: 0.2
  target: 6.0
  horizon: 10000
  
generation:
  top_k: 0.0
  top_p: 1.0
  do_sample: True

31. DeepSpeed Configuration for RLHF

When running PPO on 8x A100s, you need DeepSpeed ZeRO-3 to fit the optimizer states.

{
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "none"
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Ops Note: stage3_gather_16bit_weights_on_model_save: True. This is critical. If False, your saved model is just sharded pointers, and you can’t load it for inference without DeepSpeed.

32. Monitoring: The W&B Dashboard

What metrics determine success?

ppo/loss/total: Should decrease. If it spikes, your learning rate is too high.
ppo/policy/entropy: Should decrease slowly. If it drops to 0 quickly, Mode Collapse.
ppo/policy/kl: The most important chart.
- Goal: Flat line around target (e.g. 6.0).
- Rising: Model drifting too far (Outputting garbage). -> Increase $\beta$.
- Falling: Model staying too close (Not learning). -> Decrease $\beta$.
env/reward_mean: Should go UP. If flat, your Review Model is broken or data is bad.
env/reward_std: Should be stable. Strategies often exploit high variance.

33. Conclusion

You now have a complete understanding of the MLOps needed for Chapter 20. From procuring a model (20.1), fine-tuning it (20.2), scaling it (20.3), to aligning it (20.4).

This concludes Part VIII: Operationalizing Foundation Models. Next, we move to Part IX: Prompt Engineering Operations.

End of Chapter 20.

34. Acknowledgements

Thanks to the Hugging Face TRL team (Leandro, Younes) for democratizing RLHF. Their library is the backbone of this chapter’s code examples.

Final Thoughts on Safety

Alignment is a journey, not a destination. No model is 100% safe. Ops must provide defense in depth: Model Alignment + Guardrails + Human Oversight.

Keyboard shortcuts

The MLOps Omni-Reference