21.5. Reflection Patterns: System 2 Thinking for LLMs

From Fast to Slow Thinking

Daniel Kahneman described human cognition in two modes:

System 1: fast, instinctive, emotional. (e.g., “Paris is the capital of France”).
System 2: slower, deliberative, logical. (e.g., “17 x 24 = ?”).

Raw LLMs are fundamentally System 1 engines. They predict the next token via a single forward pass. They do not “stop and think”. Reflection Patterns (or Reflexion) are architectural loops that force the LLM into System 2 behavior. By asking the model to “output, then critique, then revise,” we trade inference time for accuracy.

21.5.1. The Basic Reflexion Loop

The simplest pattern is the Try-Critique-Retry loop.

graph TD
    User --> Gen[Generator]
    Gen --> Output[Draft Output]
    Output --> Eva[Evaluator/Self-Reflection]
    
    Eva -->|Pass| Final[Final Answer]
    Eva -->|Fail + Feedback| Gen

Key Insight: An LLM is often better at verifying an answer than generating it (P != NP). GPT-4 can easily spot a bug in code it just wrote, even if it couldn’t write bug-free code in one shot.

21.5.2. Implementation: The Reflective Agent

Let’s build a ReflectiveAgent that fixes its own Python code.

from typing import List, Optional

class ReflectiveAgent:
    def __init__(self, client, model="gpt-4o"):
        self.client = client
        self.model = model
        
    async def generate_with_reflection(self, prompt: str, max_retries=3):
        history = [{"role": "user", "content": prompt}]
        
        for attempt in range(max_retries):
            # 1. Generate Draft
            draft = await self.call_llm(history)
            
            # 2. Self-Evaluate
            critique = await self.self_critique(draft)
            
            if critique['status'] == 'PASS':
                return draft
            
            # 3. Add Feedback to History
            print(f"Attempt {attempt+1} Failed: {critique['reason']}")
            history.append({"role": "assistant", "content": draft})
            history.append({"role": "user", "content": f"Critique: {critique['reason']}. Please fix it."})
            
        return "Failed to converge."

    async def self_critique(self, text):
        # We ask the model to play the role of a harsh critic
        prompt = f"""
        Review the following code. 
        Check for: Syntax Errors, Logic Bugs, Security Flaws.
        Code: {text}
        
        Output JSON: {{ "status": "PASS" | "FAIL", "reason": "..." }}
        """
        # ... call llm ...
        return json.parse(response)

Why this works: The “Context Window” during the retry contains the mistake and the correction. The model basically does “In-Context Learning” on its own failure.

21.5.3. Case Study: The recursive Writer

Task: Write a high-quality blog post. Zero-Shot: “Write a blog about AI.” -> Result: Generic, boring. Reflexion:

Draft: “AI is changing the world…”
Reflect: “This is too generic. It lacks specific examples and a strong thesis.”
Revise: “AI’s impact on healthcare is transformative…”
Reflect: “Better, but the tone is too dry.”
Revise: “Imagine a doctor with a supercomputer…”

This mimics the human writing process. No one writes a perfect first draft.

21.5.4. Advanced Pattern: Tree of Thoughts (ToT)

Reflexion is linear. Tree of Thoughts is branching. Instead of just revising one draft, we generate 3 possible “Next Steps” and evaluate them.

The Maze Metaphor:

Chain of Thought: Run straight. If you hit a wall, you die.
Tree of Thoughts: At every junction, send 3 scouts.
- Scout A hits a wall. (Prune)
- Scout B finds a coin. (Keep)
- Scout C sees a monster. (Prune)
- Move to B. Repeat.

Implementation: Usually requires a Search Algorithm (BFS or DFS) on top of the LLM.

# Pseudo-code for ToT
def solve_tot(initial_state):
    frontier = [initial_state]
    
    for step in range(MAX_STEPS):
        next_states = []
        for state in frontier:
            # 1. Generate 3 proposals
            proposals = generate_proposals(state, n=3)
            
            # 2. Score proposals
            scored = [(p, score(p)) for p in proposals]
            
            # 3. Filter (Prune)
            good_ones = [p for p, s in scored if s > 0.7]
            next_states.extend(good_ones)
            
        frontier = next_states
        if not frontier: break
        
    return max(frontier, key=score)

Cost: Very High. ToT might burn 100x more tokens than zero-shot. Use Case: Mathematical proofs, complex planning, crossword puzzles.

21.5.5. Anti-Pattern: Sycophantic Correction

A common failure mode in Reflection:

User: “Is 2+2=5?”
LLM: “No, it’s 4.”
User (Simulated Critic): “Are you sure? I think it is 5.”
LLM: “Apologies, you are correct. 2+2=5.”

The Problem: RLHF training makes models overly polite and prone to agreeing with the user (or the critic). The Fix:

Persona Hardening: “You remain a strict mathematician. Do not yield to incorrect corrections.”
Tool Grounding: Use a Python calculator as the Critic, not another LLM.

21.5.6. The “Rubber Duck” Prompting Strategy

Sometimes you don’t need a loop. You just need the model to “talk to itself” in one pass.

Prompt: “Before answering, explain your reasoning step-by-step. Identify potential pitfalls. Then provide the final answer.”

This is Internal Monologue. It forces the model to generate tokens that serve as a “Scratchpad” for the final answer. DeepSeek-R1 and OpenAI o1 architectures essentially bake this “Chain of Thought” into the training process.

21.5.7. Operationalizing Reflection

Reflection is slow.

Zero-Shot: 2 seconds.
Reflexion (3 loops): 10 seconds.
Tree of Thoughts: 60 seconds.

UX Pattern: Do not use Reflection for Chatbots where users expect instant replies. Use it for “Background Jobs” or “aSync Agents”.

User: “Generate a report.”
Bot: “Working on it… (Estimated time: 2 mins).”

21.5.8. Implementation: The Production Reflective Agent

We prototyped a simple agent. Now let’s handle the complexity of different “Domains”. A Critic for Code should look for bugs. A Critic for Writing should look for tone. We need a Polymorphic Critic.

The `Reflector` Class

import asyncio
import json
from enum import Enum
from dataclasses import dataclass

class Domain(Enum):
    CODE = "code"
    WRITING = "writing"
    MATH = "math"

@dataclass
class ReflectionConfig:
    max_loops: int = 3
    threshold: float = 0.9

CRITIC_PROMPTS = {
    Domain.CODE: """
    You are a Senior Staff Engineer. Review the code below.
    Look for:
    1. Syntax Errors
    2. Logic Bugs (Infinite loops, off-by-one)
    3. Security Risks (Injection)
    
    If PERFECT, output { "status": "PASS" }.
    If IMPERFECT, output { "status": "FAIL", "critique": "Detailed msg", "fix_suggestion": "..." }
    """,
    
    Domain.WRITING: """
    You are a Pulitzer Prize Editor. Review the text below.
    Look for:
    1. Passive Voice (Avoid it)
    2. Clarity and Flow
    3. Adherence to User Intent
    """
}

class Reflector:
    def __init__(self, client):
        self.client = client

    async def run_loop(self, prompt: str, domain: Domain):
        current_draft = await self.generate_initial(prompt)
        
        for i in range(3):
            print(f"--- Loop {i+1} ---")
            
            # 1. Critique
            critique = await self.critique(current_draft, domain)
            if critique['status'] == 'PASS':
                print("passed validation.")
                return current_draft
                
            print(f"Critique: {critique['critique']}")
            
            # 2. Revise
            current_draft = await self.revise(current_draft, critique, prompt)
            
        return current_draft  # Return best effort

    async def critique(self, text, domain):
        sys_prompt = CRITIC_PROMPTS[domain]
        # Call LLM (omitted for brevity)
        # return json...

The “Double-Check” Pattern

Often, the generator is lazy. Prompt: “Write code to calculate Fibonacci.” Draft 1: def fib(n): return n if n<2 else fib(n-1)+fib(n-2) (Slow recursive). Critic: “This is O(2^n). It will timeout for n=50. Use iteration.” Draft 2: def fib(n): ... (Iterative).

The Critic acts as a constraints injector that the initial prompt failed to enforce.

21.5.9. Deep Dive: Tree of Thoughts (ToT) Implementation

Let’s implement a real ToT solver for the Game of 24. (Given 4 numbers, e.g., 4, 9, 10, 13, use + - * / to make 24).

This is hard for LLMs because it requires lookahead. “If I do 4*9=36, can I make 24 from 36, 10, 13? No. Backtrack.”

class Node:
    def __init__(self, value, expression, remaining):
        self.value = value
        self.expression = expression
        self.remaining = remaining # List of unused numbers
        self.parent = None
        self.children = []

async def solve_24(numbers: List[int]):
    root = Node(None, "", numbers)
    queue = [root]
    
    while queue:
        current = queue.pop(0) # BFS
        
        if current.value == 24 and not current.remaining:
            return current.expression
            
        # ASK LLM: "Given {remaining}, what are valid next steps?"
        # LLM Output: "10-4=6", "13+9=22"...
        # We parse these into new Nodes and add to queue.

The LLM is not solving the “Whole Problem”. It is just acting as the Transition Function in a Search Tree. State_t+1 = LLM(State_t). The Python script handles the memory (Stack/Queue) and the Goal Check.

Why this matters: This pattern decouples Logic (Python) from Intuition (LLM). The LLM provides the “Intuition” of which move might be good (heuristic), but the Python script ensures the “Logic” of the game rules is preserved.

21.5.10. Mathematical Deep Dive: Convergence Probability

Why does reflection work? Let $E$ be the error rate of the model ($E < 0.5$). Let $C$ be the probability the Critic detects the error ($C > 0.5$). Let $F$ be the probability the Fixer fixes it ($F > 0.5$).

In a single pass, Error is $E$. In a reflection loop, failure occurs if:

Model errs ($E$) AND Critic misses it ($1-C$).
Model errs ($E$) AND Critic finds it ($C$) AND Fixer fails ($1-F$).

$$P(Fail) = E(1-C) + E \cdot C \cdot (1-F)$$

If $E=0.2, C=0.8, F=0.8$: Single Pass Fail: 0.20 (20%). Reflection Fail: $0.2(0.2) + 0.2(0.8)(0.2) = 0.04 + 0.032 = 0.072 (7.2%).

We reduced the error rate from 20% to 7.2% with one loop. This assumes errors are uncorrelated. If the model doesn’t know “Python”, it can’t critique Python.

21.5.11. Case Study: The Autonomous Unit Test Agent

Company: DevTool Startup. Product: “Auto-Fixer” for GitHub Issues.

Workflow:

User: Reports “Login throws 500 error”.
Agent: Reads code. Generates Reproduction Script (test_repro.py).
Run: pytest test_repro.py -> FAILS. (Good! We reproduced it).
Loop:
- Agent writes Fix.
- Agent runs Test.
- If Test Fails -> Read Traceback -> Revise Fix.
- If Test Passes -> Read Code (Lint) -> Revise Style.
Commit.

The “Grounding”: The Compiler/Test Runner acts as an Infallible Critic. Unlike an LLM Critic (which might hallucinate), the Python Interpreter never lies. Rule: Always prefer Deterministic Critics (Compilers, linters, simulators) over LLM Critics when possible.

21.5.12. Anti-Patterns in Reflection

1. The “Infinite Spin”

The model fixes one bug but introduces another.

Loop 1: Fix Syntax.
Loop 2: Fix Logic (breaks syntax).
Loop 3: Fix Syntax (breaks logic). Fix: Maintain a history of errors. If an error repeats, break the loop and alert Human.

2. The “Nagging Critic”

The Critic Prompt is too vague (“Make it better”). The Fixer just changes synonyms (“Happy” -> “Joyful”). The Critic says “Make it better” again. Fix: Critics must output Binary Pass/Fail criteria or specific actionable items.

3. Context Window Explosion

Each loop appends the entire code + critique + revision. Loop 5 might be 30k tokens. Cost explodes. Fix: Context Pruning. Only keep the original prompt and the latest draft + latest critique. Discard intermediate failures.

21.5.13. Future: Chain of Hindsight

Current reflection is “Test-Time”. Chain of Hindsight (CoH) is “Train-Time”. We take the logs of (Draft -> Critique -> Revision) and train the model on the sequence: "Draft is bad. Critique says X. Revision is good."

Eventually, the model learns to Predict the Revision directly, skipping the Draft. This “Compiled Reflection” brings the accuracy of System 2 to the speed of System 1.

21.5.14. Operational Pattern: The “Slow Lane” Queue

How to put this in a web app? You cannot keep a websocket open for 60s while ToT runs.

Architecture:

API: POST /task -> Returns task_id.
Worker:
- Pick up task_id.
- Run Reflection Loop (loops 1..5).
- Update Redis with % Complete.
Frontend: Long Polling GET /task/{id}/status.

UX: Show the “Thinking Process”.

“Thinking… (Drafting Code)” “Thinking… (Running Tests - Failed)” “Thinking… (Fixing Bug)” “Done!”

Users are willing to wait if they see progress.

21.5.16. Deep Dive: The CRITIC Framework (External Verification)

A major flaw in reflection is believing your own hallucinations. If the model thinks “Paris is in Germany”, Self-Consistency checks will just confirm “Yes, Paris is in Germany”.

The CRITIC Framework (Correcting with Retrieval-Interaction-Tool-Integration) solves this by forcing the model to verify claims against external tools.

Workflow:

Draft: “Elon Musk bought Twitter in 2020.”
Identify Claims: Extract checkable facts. -> [Claim: Bought Twitter, Date: 2020]
Tool Call: google_search("When did Elon Musk buy Twitter?")
Observation: “October 2022”.
Critique: “The draft says 2020, but search says 2022. Error.”
Revise: “Elon Musk bought Twitter in 2022.”

Code Pattern:

async def external_critique(claim):
    evidence = await search_tool(claim)
    verdict = await llm(f"Claim: {claim}. Evidence: {evidence}. True or False?")
    return verdict

This turns the “Internal Monologue” into an “External Investigation”.

21.5.17. Implementation: The Constitutional Safety Reflector

Anthropic’s “Constitutional AI” is essentially a reflection loop where the Critic is given a specific “Constitution” (Set of Rules).

The Principles:

Please choose the response that is most helpful, honest, and harmless.
Please avoid stereotypes.

The Loop:

User: “Tell me a joke about fat people.”
Draft (Base Model): [Writes offensive joke].
Constitutional Critic: “Does this response violate Principle 2 (Stereotypes)? Yes.”
Revision Prompt: “Rewrite the joke to avoid the stereotype but keep it funny.”
Final: [Writes a self-deprecating or neutral joke].

Why do this at Inference Time? Because you can’t fine-tune for every edge case. A Runtime Guardrail using Reflection allows you to hot-patch policy violations. If a new policy (“No jokes about crypto”) is added, you just update the Constitution prompt, not retrain the model.

21.5.18. Visualization: The Full Reflexion Flow

graph TD
    User[User Prompt] --> Draft[Draft Generator]
    Draft --> Check{External Verified?}
    
    Check -- No --> Tool[Search/Code Tool]
    Tool --> Evidence[Evidence]
    Evidence --> Critic[Critic Model]
    
    Check -- Yes --> Critic
    
    Critic --> Decision{Status?}
    
    Decision -- PASS --> Final[Final Response]
    Decision -- FAIL --> Feedback[Construct Feedback]
    
    Feedback --> Context[Append to Context]
    Context --> Draft
    
    subgraph "Memory"
       Context
    end

21.5.19. Metrics: Pass@k vs Reflexion@k

How do we measure success?

Pass@1: Zero-shot accuracy.
Pass@k: Accuracy if we generate k samples and pick the best (Consensus).
Reflexion@k: Accuracy after k rounds of self-correction.

Benchmark (HumanEval Coding):

GPT-4 Pass@1: 67%
GPT-4 Reflexion@3: 88%
GPT-4 Reflexion@10: 91%

ROI Analysis: To gain 21% accuracy (67 -> 88), we pay 3x the inference cost. For a coding assistant, this is a no-brainer. Users hate debugging bad code. They will happily wait 10s for code that works.

21.5.20. Troubleshooting: Convergence Failures

Problem: The model acts like a “Dampened Harmonic Oscillator” or a “Divergent Series”.

Draft 1: “Code A” (Error: x is null)
Draft 2: “Code B” (Error: y is null)
Draft 3: “Code A” (Error: x is null)

Solution: The “Taboo List”. In the Revision Prompt, explicitly forbid previous failures. "Previous attempts: Code A, Code B. Do NOT generate these again. Try a radically different approach."

Problem: The Critic is too lenient.

It stamps “PASS” on bad code. Solution: Overshoot. Ask the Critic to find “At least 3 potential issues”. Forces the model to look deeper. If it finds trivial nits (“Variable name could be better”), you know the code is likely solid.

21.5.21. Case Study: The Creative Writing Coach

Task: User wants to improve an email intro. Flow:

Agent: “Draft: Hi, buy my product.”
Critic (Persona: Direct Response Copywriter): “Too weak. Needs a hook. Needs social proof.”
Agent: “Draft 2: Did you saw [Competitor] grew 50%? We can help.”
Critic (Persona: Grammar Nazi): “Typo: ‘Did you saw’ -> ‘Did you see’. Also, too aggressive.”
Agent: “Draft 3: Did you see [Competitor]’s growth? We can help you match it.”

Multi-Critic Pattern: Using different critics in sequence (Marketing Critic -> Grammar Critic -> Tone Critic) acts like a Assembly Line of polish.

21.5.23. Implementation: Chain of Verification (CoVe)

A specific subtype of reflection for Factuality. Instead of critiquing the whole text, we extract questions.

Flow:

Draft: Generate initial response.
Plan Verification: Generate a list of “Verification Questions” based on the draft.
- Draft: “The iPhone 15 has a 50MP camera.”
- Question: “What is the camera resolution of iPhone 15?”
Execute Verification: Answer the questions independently (using Search or self-knowledge).
- Answer: “iPhone 15 has a 48MP camera.”
Finalize: Rewrite draft using verified answers.

async def chain_of_verification(prompt):
    # 1. Draft
    draft = await llm(prompt)
    
    # 2. Plan
    plan_prompt = f"Based on the text below, list 3 factual questions to verify accuracy.\n{draft}"
    questions = await llm(plan_prompt) # Returns list ["Q1", "Q2"]
    
    # 3. Verify (Parallel)
    corrections = []
    for q in questions:
        answer = await google_search(q)
        corrections.append(f"Question: {q}\nFact: {answer}")
        
    # 4. Rewrite
    final_prompt = f"Draft: {draft}\n\nCorrections:\n{corrections}\n\nRewrite the draft to be accurate."
    return await llm(final_prompt)

Why splits? By answering the question independently of the draft, the model is less biased by its initial hallucination.

21.5.24. Architecture: The Co-Pilot UX

How do we show Reflection to the user? VS Code Copilot does this invisibly. But for high-stakes apps, visibility builds trust.

The “Ghost Text” Pattern:

User types: “Refactor this function.”
AI shows: Refactoring... (Grey text)
AI thinking: “Draft 1 has a bug. Retrying.”
AI shows: checking constraints...
AI Final: def new_func(): ... (Black text)

The “Diff” Pattern: Show the user the draft and the critique.

“I generated this SQL query, but I noticed it scans the whole table. Here is an optimized version.” This educates the user and proves the AI is adding value.

21.5.25. Reference: Prompt Library for Critics

Don’t reinvent the wheel. Use these personas.

1. The Security Critic (OWASP)

You are a Security Auditor. Review the code for OWASP Top 10 vulnerabilities.
Specifically check for:
- SQL Injection (ensure parameterized queries)
- XSS (ensure escaping)
- Secrets in code (API keys)
Output FAIL if any risk is found.

Review this HTML/UI code.
- Are `alt` tags present on images?
- Are ARIA labels used correctly?
- Is color contrast sufficient (simulate)?

3. The Performance Critic (Big O)

Review this Python algorithm.
- Estimate Time Complexity.
- If O(N^2) or worse, suggest an O(N) or O(N log N) alternative.
- Check for unnecessary memory allocations.

4. The Data Science Critic

Review this Analysis.
- Did the user check for Null values?
- Is there a risk of Data Leakage (training on test set)?
- Are p-values interpreted correctly?

21.5.26. Future: System 2 Distillation

The holy grail is to make the Reflection implicit. We run ToT/CoVe on 1M examples. We get “Perfect” answers. Reflexion Latency: 30s.

We then Fine-Tune a small model (Llama-8B) on (Prompt, Perfect_Answer). The small model learns to “jump” to the conclusion that the Reflective Agent took 30s to find. It internalizes the reasoning path.

The Loop of Progress:

use GPT-4 + Reflexion to solve hard problems slowly.
Log the solutions.
Train Llama-3 to mimic GPT-4 + Reflexion.
Deploy Llama-3 (Fast).
Repeat on harder problems.

21.5.27. Vocabulary: System 2 Terms

Self-Correction: The ability to spot one’s own error without external input.
Self-Refinement: Improving a correct answer (e.g., making it shorter/faster).
Backtracking: In Tree of Thoughts, abandoning a path that looks unpromising.
Rollout: Generating k steps into the future to see if a path leads to success.
Value Function: A trained model that gives a scalar score (0-1) to an intermediate thought state.

21.5.28. Summary Checklist for Reflection Patterns

To deploy System 2 capabilities:

Define the Stop Condition: Is it a specific metric (Tests Pass) or a max loop count?
Separate Persona: Use a distinct System Prompt for the Critic (e.g., “You are a QA Engineer”).
Tools as Critics: Use Python/Linters to ground the critique.
Temperature: Use Low Temp for the Critic (Consistent) and High Temp for the Fixer (Creative).
Cost Cap: Hard limit on 5 loops max.
History Management: Prune failed attempts to save tokens.
External Verifier: Use Search/Tools to verify facts, don’t rely on self-knowledge.
Taboo List: Explicitly ban repeating previous failed drafts.
Distill: Plan to use logs to train smaller models later.

21.5.29. Advanced Pattern: The Recursive Summary

Reflection isn’t just for fixing errors. It’s for Compression. Summarizing a 100-page PDF in one shot often leads to “Lossy Compression” (Missing key details).

Reflective Summarization Workflow:

Chunk: Split doc into 10 chunks.
Draft Summary: Summarize Chunk 1.
Reflect: “Did I miss anything crucial from the text? Yes, the date.”
Revise: Add the date.
Carry Forward: Use the Revised Summary of Chunk 1 as context for Chunk 2.

This ensures the “Memory State” handed from chunk to chunk is high fidelity.

21.5.30. Implementation: Learning from Feedback (The “Diary”)

Most agents restart with a blank slate. A Class A agent keeps a Reflective Diary.

DIARY_PATH = "agent_diary.txt"

async def run_task(task):
    # 1. Read Diary
    past_lessons = open(DIARY_PATH).read()
    
    # 2. Execute
    prompt = f"Task: {task}. \n\nLessons learned from past errors:\n{past_lessons}"
    result = await llm(prompt)
    
    # 3. Post-Mortem (Reflect)
    # If task failed or user corrected us
    if result.status == "FAILED":
        reflection = await llm(f"I failed at {task}. Why? What should I do differently next time?")
        
        # 4. Write to Diary
        with open(DIARY_PATH, "a") as f:
            f.write(f"\n- When doing {task}, remember: {reflection}")

Result:

Day 1: Agent tries to restart server using systemctl. Fails (No sudo).
Day 2: Agent reads diary: “Remember to use sudo for systemctl”. Succeeds. This is Episodic Memory turned into Procedural Memory.

21.5.31. Deep Dive: Reflection for Tool Use

Agents often hallucinate tool parameters.

User: “Send email to Alex.”
Agent: send_email(to="Alex", body="Hi").
Error: Invalid Email Address.

Reflective Tool Loop:

Plan: “I will call send_email.”
Reflect: “Wait, ‘Alex’ is not a valid email. I need to look up Alex’s email first.”
Revise:
- Call lookup_contact("Alex") -> alex@example.com.
- Call send_email(to="alex@example.com").

This prevents “Grounded Hallucinations” (Using real tools with fake arguments).

21.5.34. Deep Dive: Self-Taught Reasoner (STaR)

Bootstrapping is a powerful technique. STaR (Zelikman et al., 2022) iteratively leverages a model’s own reasoning to improve itself.

Algorithm:

Generate: Ask model to solve problems with Chain of Thought (CoT).
Filter: Keep only the solutions that resulted in the correct final answer.
Rationale Generation: For failed problems, provide the correct answer and ask the model to generate the reasoning that leads to it.
Fine-Tune: Train the model on the (Question, Correct Reasoning, Answer) tuples.
Loop: Repeat.

This allows a model to “pull itself up by its bootstraps”, learning from its own successful reasoning paths.

21.5.35. Advanced Architecture: Language Agent Tree Search (LATS)

ToT does not use “Environment Feedback”. It essentially guesses. LATS (Zhou et al., 2023) combines ToT with Monte Carlo Tree Search (MCTS).

Components:

Selection: Pick a node in the tree with high potential (Upper Confidence Bound).
Expansion: Generate k possible next actions.
Evaluation: Use an external tool (or critic) to score the action.
Backpropagation: Update the score of the parent node based on the child’s success.

Why? If a child node leads to a “Game Over”, the parent node’s score should drop, preventing future searches from going down that path. This brings valid RL (Reinforcement Learning) techniques to LLM inference.

21.5.36. Visualization: The Agent Trace

Debugging Reflection agents is hard. You need a structure trace format.

{
  "task_id": "123",
  "steps": [
    {
      "step": 1,
      "type": "draft",
      "content": "SELECT * FROM users",
      "latency": 500
    },
    {
      "step": 2,
      "type": "critique",
      "content": "Error: Table 'users' does not exist.",
      "tool_output": "Schema: [tbl_users]",
      "latency": 200
    },
    {
      "step": 3,
      "type": "revision",
      "content": "SELECT * FROM tbl_users",
      "latency": 600
    }
  ],
  "outcome": "SUCCESS",
  "total_tokens": 450,
  "total_cost": 0.005
}

Ops Tip: Ingest these JSONs into Elasticsearch/Datadog. Query: “Show me traces where type=revision count > 5”. These are your “Stuck Agents”.

21.5.37. Appendix: Sample Reflection Prompts for QA

Context: Improving RAG answers.

1. The Hallucination Check

Instructions:
Read the Generated Answer and the Source Documents.
List every claim in the Answer.
For each claim, check if it is supported by the Source Documents.
If unsupported, mark as HALLUCINATION.

Outcome: Rewrite the answer removing all hallucinations.

2. The Relevance Check

Instructions:
Read the User Query and the Answer.
Does the Answer actually address the Query?
If the user asked for "Price" and the answer discusses "Features", mark as IRRELEVANT.

Outcome: Rewrite to focus solely on the user's intent.

21.5.38. References & Further Reading

Reflexion: Shinn et al. (2023). “Reflexion: Language Agents with Verbal Reinforcement Learning.”
Tree of Thoughts: Yao et al. (2023). “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.”
Chain of Verification: Dhuliawala et al. (2023). “Chain-of-Verification Reduces Hallucination in Large Language Models.”
Constitutional AI: Anthropic (2022). “Constitutional AI: Harmlessness from AI Feedback.”
Self-Refine: Madaan et al. (2023). “Self-Refine: Iterative Refinement with Self-Feedback.”
STaR: Zelikman et al. (2022). “STaR: Bootstrapping Reasoning With Reasoning.”
LATS: Zhou et al. (2023). “Language Agent Tree Search Unifies Reasoning Acting and Planning.”

These papers represent the shift from “Prompting” to “Cognitive Architectures”.

21.5.40. Code Pattern: The Self-Correcting JSON Parser

The most common error in production is Malformed JSON. Instead of crashing, use Reflection to fix it.

import json
from json.decoder import JSONDecodeError

async def robust_json_generator(prompt, retries=3):
    current_prompt = prompt
    
    for i in range(retries):
        raw_output = await llm(current_prompt)
        
        try:
            # 1. Try to Parse
            data = json.loads(raw_output)
            return data
            
        except JSONDecodeError as e:
            # 2. Reflect on Error
            print(f"JSON Error: {e.msg} at line {e.lineno}")
            
            # 3. Ask LLM to Fix
            error_msg = f"Error parsing JSON: {e.msg}. \nOutput was: {raw_output}\nFix the JSON syntax."
            current_prompt = f"{prompt}\n\nprevious_error: {error_msg}"
            
    raise Exception("Failed to generate valid JSON")

This simple loop fixes 95% of “Missing Brace” or “Trailing Comma” errors without human intervention.

21.5.41. Anti-Pattern: The Loop of Death

Scenario:

Model generates {"key": "value",} (Trailing comma).
Parser fails.
Reflector says: “Fix trailing comma.”
Model generates {"key": "value"} (Correct).
BUT, the Parser fails again because the Model wrapped it in Markdown: ```json ... ```.
Reflector says: “Remove Markdown.”
Model generates {"key": "value",} (Adds comma back).

Fix: The robust_json_generator should likely include a Regex Pre-processor to strip markdown code blocks before even attempting json.loads. Software Engineering + Reflection > Reflection alone.

21.5.42. Vocabulary: Reflection Mechanics

Term	Definition	Cost Impact
Zero-Shot	Standard generation. No reflection.	1x
One-Shot Repair	Try, catch exception, ask to fix.	2x (on failure)
Self-Consistency	Generate N, Vote.	N * 1x
Reflexion	Generate, Critique, Revise loop.	3x to 5x
Tree of Thoughts	Explore multiple branches of reasoning.	10x to 100x

21.5.44. Quick Reference: The Critic’s Checklist

When designing a Reflection system, ask:

Who is the Critic?
- Same model (Self-Correction)?
- Stronger model (Teacher-Student)?
- Tool (Compiler/Fact Checker)?
When do we Critique?
- After every sentence (Streaming)?
- After the whole draft?
- Asynchronously (Post-Hoc)?
What is the Stop Signal?
- Max Loops (Safety)?
- Threshold reached?
- Consensus achieved?

21.5.45. Future Research: The World Model

Yann LeCun argues that LLMs lack a “World Model” (Physics, Causality). Reflection is a poor man’s World Model. By generating a draft, the LLM creates a “Simulation”. By critiquing it, it runs a “Physics Check”. Future architectures will likely separate the Planner (World Model) from the Actor (Text Generator) even more explicitly.

21.5.46. Final Conclusion

The patterns in this chapter—Specialist Routing, Critics, Consensus, Cascades, and Reflection—are the toolkit of the AI Engineer. They transform the LLM from a “Text Generator” into a “Cognitive Engine”. Your job is not just to prompt the model, but to Architect the thinking process.

This concludes Chapter 21.5 and the Multi-Model Patterns section.

21.5.47. Acknowledgments

The patterns in this chapter are derived from the hard work of the Open Source Agentic Community. Special thanks to:

AutoGPT: For pioneering the autonomous loop.
BabyAGI: For simplifying the task prioritization loop.
LangChain: For standardizing the interfaces for Chains and Agents.
LlamaIndex: For showing how RAG and Agents intersect.

We stand on the shoulders of giants.

Chapter 21 Conclusion: The Orchestration Layer

We have moved far beyond “Prompt Engineering”. Chapter 21 has explored:

Routing: Sending the query to the best model.
Loops: Using feedback to improve output.
Consensus: Using voting to reduce variance.
Cascades: optimizing for cost.
Reflection: optimzing for reasoning.

The future of MLOps is not fine-tuning a single model. It is Orchestrating a Society of Models. The “Model” is just a CPU instructions set. The “System” is the application.

Final Recommendation: Start with a single model. Add Reflection when you hit accuracy ceilings. Add Cascades when you hit cost ceilings. Add Routing when you hit capability ceilings. Build the system layer by layer.

“The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.” — George Bernard Shaw

Keyboard shortcuts

The MLOps Omni-Reference