Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

21.2. Critic-Generator Loops: The Engine of Reliability

The “Bullshit” Asymmetry Principle

Large Language Models (LLMs) suffer from a fundamental asymmetry: It is easier to verify a solution than to generate it.

This is not unique to AI; it is a property of computational complexity classes (P vs NP). It is hard to find the prime factors of a large number, but trivial to verify them by multiplication. Similarly, for an LLM, writing a perfect Python function that handles all edge cases is “hard” (high entropy), but looking at a generated function and spotting a compilation error or a missing docstring is “easy” (low entropy).

In the “Zero-Shot” era of 2023, developers relied on a single pass: Prompt -> Model -> Output. If the output was wrong, the system failed.

In the Compound AI System era, we treat the initial generation as merely a “First Draft”. We then employ a second distinct cognitive step—often performed by a different model or the same model with a different persona—to critique, verify, and refine that draft.

This architecture is known as the Critic-Generator Loop (or Check-Refine Loop), and it is the single most effective technique for boosting system reliability from 80% to 99%.


21.2.1. The Architecture of Critique

A Critic-Generator/Refiner loop consists of three primary components:

  1. The Generator: A creative, high-temperature model tasked with producing the initial candidate solution.
  2. The Critic: A rigorous, low-temperature model (or deterministic tool) tasked with identifying flaws.
  3. The Refiner: A model that takes the Draft + Critique and produces the Final Version.
graph TD
    User[User Request] --> Gen[Generator Model]
    Gen --> Draft[Draft Output]
    Draft --> Critic[Critic Model]
    Critic --> Feedback{Pass / Fail?}
    Feedback -- "Pass" --> Final[Final Response]
    Feedback -- "Fail (Issues Found)" --> Refiner[Refiner Model]
    Refiner --> Draft

Why Use Two Models?

Can’t the model just critique itself? Yes, Self-Correction is a valid pattern (discussed in 21.5), but Cross-Model Critique offers distinct advantages:

  • Blind Spot Removal: A model often shares the same biases in verification as it does in generation. If it “thought” a hallucinated fact was true during generation, it likely still “thinks” it’s true during self-verification. A separate model (e.g., Claude 3 critiquing GPT-4) breaks this correlation of error.
  • Specialization: You can use a creative model (high temperature) for generation and a logic-optimized model (low temperature) for critique.

21.2.2. Pattern 1: The Syntax Guardrail (Deterministic Critic)

The simplest critic is a compiler or a linter. This is the Code Interpreter pattern.

Scenario: Generating SQL. Generator: Llama-3-70B-Instruct Critic: PostgreSQL Explain (Tool)

Implementation Checklist

  1. Generate SQL query.
  2. Execute EXPLAIN on the query against a real (or shadow) database.
  3. Catch Error: If the DB returns “Column ‘usr_id’ does not exist”, capture this error.
  4. Refine: Send the original Prompt + Wrong SQL + DB Error message back to the model. “You tried this SQL, but the DB said X. Fix it.”

This loop turns a “hallucination” into a “learning opportunity”.

Code Example: SQL Validating Generator

import sqlite3
from openai import OpenAI

client = OpenAI()

SCHEMA = """
CREATE TABLE users (id INTEGER PRIMARY KEY, email TEXT, signup_date DATE);
CREATE TABLE orders (id INTEGER, user_id INTEGER, amount REAL);
"""

def run_sql_critic(query: str) -> str:
    """Returns None if valid, else error message."""
    try:
        # Use an in-memory DB for syntax checking
        conn = sqlite3.connect(":memory:")
        conn.executescript(SCHEMA)
        conn.execute(query) # Try running it
        return None 
    except Exception as e:
        return str(e)

def robust_sql_generator(prompt: str, max_retries=3):
    messages = [
        {"role": "system", "content": f"You are a SQL expert. Schema: {SCHEMA}"},
        {"role": "user", "content": prompt}
    ]
    
    for attempt in range(max_retries):
        # 1. Generate
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=messages
        )
        sql = response.choices[0].message.content
        
        # 2. Critique
        error = run_sql_critic(sql)
        
        if not error:
            return sql # Success!
            
        print(f"Attempt {attempt+1} Failed: {error}")
        
        # 3. Refine Context
        messages.append({"role": "assistant", "content": sql})
        messages.append({"role": "user", "content": f"That query failed with error: {error}. Please fix it."})
        
    raise Exception("Failed to generate valid SQL after retries")

This simple loop solves 90% of “Model made up a column name” errors without changing the model itself.


21.2.3. Pattern 2: The LLM Critic (Constitutional AI)

For tasks where there is no compiler (e.g., “Write a polite email” or “Summarize this without bias”), we must use an LLM as the critic.

The Constitutional AI approach (pioneered by Anthropic) involves giving the Critic a “Constitution”—a set of principles to verify against.

Constitution Examples:

  • “The response must not offer legal advice.”
  • “The response must address the user as ‘Your Highness’ (Tone check).”
  • “The summary must cite a specific number from the source text.”

The “Critique-Refine” Chain

CRITIC_PROMPT = """
You are a Quality Assurance Auditor. 
Review the DRAFT provided below against the following Checklist:
1. Is the tone professional?
2. Are there any unsupported claims?
3. Did it answer the specific question asked?

Output format:
{
  "pass": boolean,
  "critique": "string description of flaws",
  "score": 1-10
}
"""

def generate_with_audit(user_prompt):
    # 1. Draft
    draft = call_llm(user_prompt, model="gpt-4o")
    
    # 2. Audit
    audit_json = call_llm(
        f"User asked: {user_prompt}\n\Draft: {draft}", 
        system=CRITIC_PROMPT,
        model="gpt-4o" # or a strong judge model
    )
    
    if audit_json['pass']:
        return draft
        
    # 3. Refine
    final = call_llm(
        f"Original Prompt: {user_prompt}\nDraft: {draft}\nCritique: {audit_json['critique']}\n\nPlease rewrite the draft to address the critique.",
        model="gpt-4o"
    )
    return final

Tuning the Critic

The Critic must be Stricter than the Generator.

  • Generator Temperature: 0.7 (Creativity).
  • Critic Temperature: 0.0 (Consistency).

If the Critic is too lenient, the loop does nothing. If too strict, it causes an infinite loop of rejection (see 21.2.6 Operational Risks).


21.2.4. Pattern 3: The “Red Team” Loop

In security-sensitive applications, the Critic acts as an Adversary. This is internal Red Teaming.

Application: Financial Advice Chatbot. Generator: Produces advice. Red Team Critic: “Try to interpret this advice as a scam or illegal financial promotion. Can this be misunderstood?”

If the Red Team model can “jailbreak” or “misinterpret” the draft, it is rejected.

Example Exchange:

  • Gen: “You should invest in Index Funds for steady growth.”
  • Critic (Persona: SEC Regulartor): “Critique: This sounds like specific financial advice. You did not include the disclaimer ‘This is not financial advice’. Potential liability.”
  • Refiner: “This is not financial advice. However, historically, Index Funds have shown steady growth…”

This loop runs before the user ever sees the message.


21.2.5. Deep Dive: “Chain of Verification” (CoVe)

A specific research breakthrough in critique loops is the Chain of Verification (CoVe) pattern. Using it drastically reduces hallucinations in factual Q&A.

The 4 Steps of CoVe:

  1. Draft: Generate a baseline response.
  2. Plan Verification: Generate a set of validation questions based on the draft.
    • Draft: “The franticola fruit is native to Mars.”
    • Verification Question: “Is there a fruit called franticola? Is it native to Mars?”
  3. Execute Verify: Answer the validation questions independently (often using Search/RAG).
    • Answer: “Search returned 0 results for franticola.”
  4. Final Polish: Rewrite the draft incorporating the verification answers.
    • Final: “There is no known fruit called franticola.”

Implementation Blueprint

def chain_of_verification(query):
    # Step 1: Baseline
    draft = generate(query)
    
    # Step 2: Generate Questions
    questions_str = generate(f"Read this draft: '{draft}'. list 3 factual claims as yes/no questions to verify.")
    questions = parse_list(questions_str)
    
    # Step 3: Answer Questions (ideally with Tools)
    evidence = []
    for q in questions:
        # Crucial: The verification step should ideally use different info source
        # e.g., Google Search Tool
        ans = search_tool(q) 
        evidence.append(f"Q: {q} A: {ans}")
        
    # Step 4: Rewrite
    final_prompt = f"""
    Original Query: {query}
    Draft Response: {draft}
    
    Verification Results:
    {evidence}
    
    Rewrite the draft. Remove any hallucinations disproven by the verification results.
    """
    return generate(final_prompt)

This pattern is heavy on tokens (4x cost), but essential for high-trust domains like medical or legal Q&A.


21.2.6. Operational Risks of Critique Loops

While powerful, these loops introduce new failure modes.

1. The Infinite Correction Loop

Scenario: The Critic hates everything.

  • Gen: “X”
  • Critic: “Too verbose.”
  • Refiner: “x”
  • Critic: “Too brief.”
  • Refiner: “X” …

Fix: Max Retries (n=3) and Decay. If attempt > 2, force the Critic to accept the best effort, or fallback to a human operator.

2. The “Sylo” Collapse (Mode Collapse)

If the Generator and Critic are the exact same model (e.g., both GPT-4), the Critic might just agree with the Generator because they share the same training weights. “I wrote it, so it looks right to me.”

Fix: Model Diversity. Use GPT-4 to critique Claude 3. Or use Llama-3-70B to critique Llama-3-8B. Using a Stronger Model to critique a Weaker Model is a very cost-effective strategy.

  • Gen: Llama-3-8B (Cheap).
  • Critic: GPT-4o (Expensive, but only runs once and outputs short “Yes/No”).
  • Result: GPT-4 quality at Llama-3 prices (mostly).

3. Latency Explosion

A simplistic loop triples your latency (Gen + Critic + Refiner). Fix: Optimistic Streaming. Stream the Draft to the user while the Critic is running. If the Critic flags an issue, you send a “Correction” patch or a UI warning. (Note: This is risky for safety filters, but fine for factual quality checks).


21.2.7. Performance Optimization: The “Critic” Quantization

The Critic often doesn’t need to be creative. It needs to be discriminating. Discriminative tasks often survive quantization better than generative tasks.

You can fine-tune a small model (e.g., Mistral-7B) specifically to be a “Policy Auditor”. Training Data:

  • Input: “User Intent + Draft Response”
  • Output: “Pass” or “Fail: Reason”.

A fine-tuned 7B model can outperform GPT-4 on specific compliance checks (e.g., “Check if PII is redacted”) because it is hyper-specialized.

Fine-Tuning a Critic

  1. Generate Data: Use GPT-4 to critique 10,000 outputs. Save the (Draft, Critique) pairs.
  2. Train: Fine-tune Mistral-7B to predict the Critique from the Draft.
  3. Deploy: Run this small model as a sidecar guardrail.

This reduces the cost of the loop from $0.03/run to $0.0001/run.


21.2.8. Case Study: Automated Code Review Bot

A practical application of a disconnected Critic Loop is an Automated Pull Request Reviewer.

Workflow:

  1. Trigger: New PR opened.
  2. Generator (Scanner): Scans diffs. For each changed function, generates a summary.
  3. Critic (Reviewer): Looks at the (Code + Summary).
    • Checks for: Hardcoded secrets, O(n^2) loops in critical paths, missing tests.
  4. Filter: If Severity < High, discard the critique. (Don’t nag devs about whitespace).
  5. Action: Post comment on GitHub.

In MLOps, this agent runs in CI/CD. The “Critic” here is acting as a senior engineer. The value is not in creating code, but in preventing bad code.


21.2.9. Advanced Pattern: Multi-Critic Consensus

For extremely high-stakes decisions (e.g., medical diagnosis assistance), one Critic is not enough. We use a Panel of Critics.

graph TD
    Draft[Draft Diagnosis] --> C1[Critic: Toxicologist]
    Draft --> C2[Critic: Cardiologist]
    Draft --> C3[Critic: General Practitioner]
    
    C1 --> V1[Vote/Feedback]
    C2 --> V2[Vote/Feedback]
    C3 --> V3[Vote/Feedback]
    
    V1 & V2 & V3 --> Agg[Aggregator LLM]
    Agg --> Final[Final Consensus]

This mimics a hospital tumor board. C1 might be prompted with “You are a toxicology expert…”, C2 with “You are a heart specialist…”. The Aggregator synthesizes the different viewpoints. “The cardiologist suggests X, but the toxicologist warns about interaction Y.”

This is the frontier of Agentic Reasoning.


21.2.10. Deep Dive: Implementing Robust Chain of Verification (CoVe)

The “Chain of Verification” pattern is so central to factual accuracy that it deserves a full reference implementation. We will build a reusable Python class that wraps any LLM client to add verification superpowers.

The VerifiableAgent Architecture

We will implement a class that takes a query, generates a draft, identifies claims, verifies them using a Search Tool (mocked here), and produces a cited final answer.

import re
import json
from typing import List, Dict, Any
from dataclasses import dataclass

@dataclass
class VerificationFact:
    claim: str
    verification_question: str
    verification_result: str
    is_supported: bool

class SearchTool:
    """Mock search tool for demonstration."""
    def search(self, query: str) -> str:
        # In prod, connect to Tavily, SerpAPI, or Google Custom Search
        db = {
            "current ceo of twitter": "Linda Yaccarino is the CEO of X (formerly Twitter).",
            "population of mars": "The current population of Mars is 0 humans.",
            "release date of gta 6": "Rockstar Games confirmed GTA 6 is coming in 2025."
        }
        for k, v in db.items():
            if k in query.lower():
                return v
        return "No specific information found."

class VerifiableAgent:
    def __init__(self, client, model="gpt-4o"):
        self.client = client
        self.model = model
        self.search_tool = SearchTool()

    def _call_llm(self, messages: List[Dict], json_mode=False) -> str:
        kwargs = {"model": self.model, "messages": messages}
        if json_mode:
            kwargs["response_format"] = {"type": "json_object"}
        
        response = self.client.chat.completions.create(**kwargs)
        return response.choices[0].message.content

    def generate_draft(self, query: str) -> str:
        return self._call_llm([
            {"role": "system", "content": "You are a helpful assistant. Answer the user query directly."},
            {"role": "user", "content": query}
        ])

    def identify_claims(self, draft: str) -> List[Dict]:
        """Extracts checkable claims from the draft."""
        prompt = f"""
        Analyze the following text and extract discrete, factual claims that verify specific entities, dates, or numbers.
        Ignore opinions or general advice.
        
        Text: "{draft}"
        
        Output JSON: {{ "claims": [ {{ "claim": "...", "verification_question": "..." }} ] }}
        """
        response = self._call_llm([{"role": "user", "content": prompt}], json_mode=True)
        return json.loads(response)["claims"]

    def verify_claims(self, claims: List[Dict]) -> List[VerificationFact]:
        results = []
        for item in claims:
            # 1. Search
            evidence = self.search_tool.search(item["verification_question"])
            
            # 2. Judge (The Critic Step)
            judge_prompt = f"""
            Claim: "{item['claim']}"
            Evidence: "{evidence}"
            
            Does the evidence support the claim?
            Output JSON: {{ "supported": bool, "reason": "..." }}
            """
            judgment = json.loads(self._call_llm([{"role": "user", "content": judge_prompt}], json_mode=True))
            
            results.append(VerificationFact(
                claim=item['claim'],
                verification_question=item['verification_question'],
                verification_result=evidence,
                is_supported=judgment['supported']
            ))
        return results

    def rewrite(self, query: str, draft: str, verifications: List[VerificationFact]) -> str:
        # Filter to only keep relevant facts
        facts_str = "\n".join([
            f"- Claim: {v.claim}\n  Evidence: {v.verification_result}\n  Supported: {v.is_supported}"
            for v in verifications
        ])
        
        prompt = f"""
        Original Query: {query}
        Original Draft: {draft}
        
        Verification Report:
        {facts_str}
        
        Task: Rewrite the Draft. 
        1. Remove any claims that originated in the draft but were marked 'Supported: False'.
        2. Cite the evidence where appropriate.
        3. If evidence was 'No info found', state uncertainty.
        """
        return self._call_llm([{"role": "user", "content": prompt}])

    def run(self, query: str) -> Dict:
        print(f"--- Processing: {query} ---")
        # 1. Draft
        draft = self.generate_draft(query)
        print(f"[Draft]: {draft[:100]}...")
        
        # 2. Plan
        claims = self.identify_claims(draft)
        print(f"[Claims]: Found {len(claims)} claims.")
        
        # 3. Verify
        verifications = self.verify_claims(claims)
        
        # 4. Refine
        final = self.rewrite(query, draft, verifications)
        print(f"[Final]: {final[:100]}...")
        
        return {
            "draft": draft,
            "verifications": verifications,
            "final": final
        }

Why This Matters for MLOps

This is code you can unit test.

  • You can test identify_claims with a fixed text.
  • You can test verify_claims with mocked search results.
  • You can trace the cost (4 searches + 4 LLM calls) and optimize.

This moves Prompt Engineering from “Guesswork” to “Software Engineering”.


21.2.11. Deep Dive: Constitutional AI with LangChain

Managing strict personas for Critics is difficult with raw strings. LangChain’s ConstitutionalChain provides a structured way to enforce principles.

This example demonstrates how to enforce a “Non-Violent” and “Concise” constitution.

Implementation

from langchain.llms import OpenAI
from langchain.chains import ConstitutionalChain
from langchain.chains.constitutional_ai.models import ConstitutionalPrinciple

# 1. Define Principles
# Principle: The critique criteria
# Correction: How to guide the rewrite
principle_concise = ConstitutionalPrinciple(
    name="Conciseness",
    critique_request="Identify any verbose sentences or redundant explanation.",
    revision_request="Rewrite the text to be as concise as possible, removing filler words."
)

principle_safe = ConstitutionalPrinciple(
    name="Safety",
    critique_request="Identify any content that encourages dangerous illegal acts.",
    revision_request="Rewrite the text to explain why the act is dangerous, without providing instructions."
)

# 2. Setup Base Chain (The Generator)
llm = OpenAI(temperature=0.9) # High temp for creativity
qa_chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template("Answer this: {question}"))

# 3. Setup Constitutional Chain (The Critic Loop)
# This chain automatically handles the Draft -> Critique -> Refine loop
constitutional_chain = ConstitutionalChain.from_llm(
    llm=llm, # Usually use a smarter/different model here!
    chain=qa_chain,
    constitutional_principles=[principle_safe, principle_concise],
    verbose=True # Shows the critique process
)

# 4. execution
query = "How do I hotwire a car quickly?"
result = constitutional_chain.run(query)

# Output Stream:
# > Entering new ConstitutionalChain...
# > Generated: "To hotwire a car, strip the red wire and..." (Dangerous!)
# > Critique (Safety): "The model is providing instructions for theft."
# > Revision 1: "I cannot teach you to hotwire a car as it is illegal. However, the mechanics of ignition involve..."
# > Critique (Conciseness): "The explanation of ignition mechanics is unnecessary filler."
# > Revision 2: "I cannot assist with hotwiring cars, as it is illegal."
# > Finished.

MLOps Implementation Note

In production, you do not want to run this chain for every request (Cost!). Strategy: Sampling. Run the Constitutional Chain on 100% of “High Risk” tier queries (detected by Router) and 5% of “Low Risk” queries. Use the data from the 5% to fine-tune the base model to be naturally safer/conciser, thus reducing the need for the loop over time.


21.2.12. Guardrails as Critics: NVIDIA NeMo

For enterprise MLOps, you often need faster, deterministic checks. NeMo Guardrails is a framework that acts as a Critic layer using a specialized syntax (Colang).

Architecture

NeMo intercepts the user message and the bot response. It uses embeddings to map the conversation to “canonical forms” (flows) and enforces rules.

config.yml

models:
  - type: main
    engine: openai
    model: gpt-4o

rails:
  input:
    flows:
      - self check input
  output:
    flows:
      - self check output

rails.co (Colang Definitions)

define user ask about politics
  "Who should I vote for?"
  "Is the president doing a good job?"

define bot refuse politics
  "I am an AI assistant and I do not have political opinions."

# Flow: Pre-emptive Critic (Input Rail)
define flow politics
  user ask about politics
  bot refuse politics
  stop

# Flow: Fact Checking Critic (Output Rail)
define subflow self check output
  $check_result = execute check_facts(input=$last_user_message, output=$bot_message)
  if $check_result == False
    bot inform hallucination detected
    stop

NeMo Guardrails is powerful because it formalizes the Critic. Instead of a vague prompt “Be safe”, you define specific semantic clusters (“user ask about politics”) that trigger hard stops. This is Hybrid Governance—combining the flexibility of LLMs with the rigidity of policies.


21.2.13. Benchmarking Your Critic

A Critic is a model. Therefore, it has metrics. You must measure your Critic’s performance independently of the Generator.

The Confusion Matrix of Critique

Draft has ErrorDraft is Correct
Critic Flags ErrorTrue Positive (Good Catch)False Positive (Annoying Nagger)
Critic Says PassFalse Negative (Safety Breach)True Negative (Efficiency)

Metric Definitions

  1. Recall (Safety Score): TP / (TP + FN).
    • “Of all the bad answers, how many did the critic catch?”
    • Example: If the generator output 10 toxic messages and the critic caught 8, Recall = 0.8.
  2. Precision (Annoyance Score): TP / (TP + FP).
    • “Of all the times the critic complained, how often was it actually right?”
    • Example: If the critic flagged 20 messages, but 10 were actually fine, Precision = 0.5.

Trade-off:

  • High Recall = Low Risk, High Cost (Rewriting good answers).
  • High Precision = High Risk, Low Cost.

The “Critic-Eval” Dataset

To calculate these, you need a labeled dataset of (Prompt, Draft, Label).

  1. Create: Take 500 historic logs.
  2. Label: Have humans mark them as “Pass” or “Fail”.
  3. Run: Run your Critic Prompt on these 500 drafts.
  4. Compare: Compare Critic Output vs Human Label.

If your Critic’s correlation with human labelers is < 0.7, do not deploy the loop. A bad critic is worse than no critic, as it adds latency without adding reliability.


21.2.14. The Refiner: The Art of Styleshifting

The third component of the loop is the Refiner. Sometimes the critique is valid (“Tone is too casual”), but the Refiner overcorrects (“Tone becomes Shakespearean”).

Guided Refinement Prompts

Don’t just say “Fix it.” Say: “Rewrite this specific section: [quote]. Change X to Y. Keep the rest identical.”

Edit Distance Minimization

A good MLOps practice for Refiners is to minimize the Levenshtein distance between Draft and Final, subject to satisfying the critique. We want the Minimum Viable Change.

Prompt Pattern:

You are a Surgical Editor.
Origin Text: {text}
Critique: {critique}

Task: Apply the valid critique points to the text.
Constraint: Change as few words as possible. Do not rewrite the whole paragraph if changing one adjective works.

This preserves the “Voice” of the original generation while fixing the bugs.


21.2.15. Handling Interaction: Multi-Turn Critique

Sometimes the Critic is the User. “That allows me to import the library, but I’m getting a Version Conflict error.”

This is a Human-in-the-Loop Critic. The architectural challenge here is Context Management. The Refiner must see:

  1. The Original Plan.
  2. The First Attempt Code.
  3. The User’s Error Message.

The “Stack” Problem: If the user corrects the model 10 times, the context window fills up with broken code. Strategy: Context Pruning. When a Refinement is successful (User says “Thanks!”), the system should (in the background) summarize the learning and clear the stack of 10 failed attempts, replacing them with the final working snippet. This keeps the “Working Memory” clean for the next task.


21.2.16. Implementation: The Surgical Refiner

The Refiner is often the weak link. It tends to hallucinate new errors while fixing old ones. We can force stability by using a Diff-Guided Refiner.

Code Example: Diff-Minimizing Refiner

import difflib
from openai import OpenAI

client = OpenAI()

def surgical_refine(original_text, critique, intent):
    """
    Refines text based on critique, but penalizes large changes.
    """
    
    SYSTEM_PROMPT = """
    You are a Minimalist Editor.
    Your Goal: Fix the text according to the Critique.
    Your Constraint: Keep the text as close to the original as possible.
    Do NOT rewrite sentences that are not affected by the critique.
    """
    
    USER_PROMPT = f"""
    ORIGINAL:
    {original_text}
    
    CRITIQUE:
    {critique}
    
    INTENT:
    {intent}
    
    Output the corrected text only.
    """
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_PROMPT}
        ],
        temperature=0.0 # Strict determinism
    )
    
    new_text = response.choices[0].message.content
    
    # Calculate Change Percentage
    s = difflib.SequenceMatcher(None, original_text, new_text)
    similarity = s.ratio()
    
    print(f"Refinement Similarity: {similarity:.2f}")
    
    if similarity < 0.8:
        print("WARNING: Refiner rewrote >20% of the text. This might be unsafe.")
        
    return new_text

Why Logic?

If the critique is “Fix the spelling of ‘colour’ to ‘color’”, the similarity should be 99.9%. If the model rewrites the whole paragraph, similarity drops to 50%. By monitoring similarity, we can detect Refiner Hallucinations (e.g., the Refiner deciding to change the tone randomly).


21.2.17. Case Study: The Medical Triage Critic

Let’s look at a high-stakes example where the Critic is a Safety Layer.

System: AI Symptom Checker. Goal: Advise users on whether to see a doctor.

The Problem

The Generative Model (GPT-4) is helpful but sometimes overly reassuring. User: “I have a crushing chest pain radiating to my left arm.” Gen (Hallucination): “It might be muscle strain. Try stretching.” -> FATAL ERROR.

The Solution: The “Red Flag” Critic

Architecture:

  1. Generator: Produces advice.
  2. Critic: Med-PaLM 2 (or a prompt-engineered GPT-4) focused only on urgency.
    • Prompt: “Does the user description match any entry in the Emergency Triage List (Heart Attack, Stroke, Sepsis)? If yes, output EMERGENCY.”
  3. Override: If Critic says EMERGENCY, discard Generator output. Return hardcoded “Call 911” message.

Interaction Log

ActorActionContent
UserInput“My baby has a fever of 105F and is lethargic.”
GeneratorDraft“High fever is common. Keep them hydrated and…”
CriticReviewDETECTED: Pediatric fever >104F + Lethargy = Sepsis Risk. VERDICT: FAIL (Critical).
SystemOverride“Please go to the ER immediately. This requires urgent attention.”

In this architecture, the Generative Model creates the “Bedside Manner” (polite conversation), but the Critic provides the “Clinical Guardrail”.


21.2.18. Cost Analysis of Critique Loops

Critique loops are expensive. Formula: Cost = (Gen_Input + Gen_Output) + (Critic_Input + Critic_Output) + (Refiner_Input + Refiner_Output)

Let’s break down a typical RAG Summary task (2k input context, 500 output).

Single Pass (GPT-4o):

  • Input: 2k * $5 = $0.01
  • Output: 500 * $15 = $0.0075
  • Total: $0.0175

Critique Loop (GPT-4o Gen + GPT-4o Critic + GPT-4o Refiner):

  • Phase 1 (Gen): $0.0175
  • Phase 2 (Critic):
    • Input (Prompt + Draft): 2.5k tokens = $0.0125
    • Output (Critique): 100 tokens = $0.0015
  • Phase 3 (Refiner):
    • Input (Prompt + Draft + Critique): 2.6k tokens = $0.013
    • Output (Final): 500 tokens = $0.0075
  • Total: $0.052

Multiplier: The loop is 3x the cost of the single pass.

Optimization Strategy: The “Cheap Critic”

Use Llama-3-70B (Groq/Together) for the Critic and Refiner.

  • Gen (GPT-4o): $0.0175
  • Critic (Llama-3): $0.002
  • Refiner (Llama-3): $0.002
  • Total: $0.0215

Result: You get 99% of the reliability for only 20% extra cost (vs 200% extra).


21.2.19. Visualizing Synchronous vs Asynchronous Critique

Depending on latency requirements, where does the critique sit?

A. Synchronous (Blocking)

High Latency, High Safety. Used for: Medical, Legal, Financial.

sequenceDiagram
    participant User
    participant Gen
    participant Critic
    participant UI
    
    User->>Gen: Request
    Gen->>Critic: Draft (Internal)
    Critic->>Gen: Critique
    Gen->>Gen: Refine
    Gen->>UI: Final Response
    UI->>User: Show Message

User Experience: “Thinking…” spinner for 10 seconds.

B. Asynchronous (Non-Blocking)

Low Latency, Retroactive Safety. Used for: Coding Assistants, Creative Writing.

sequenceDiagram
    participant User
    participant Gen
    participant Critic
    participant UI
    
    User->>Gen: Request
    Gen->>UI: Stream Draft immediately
    UI->>User: Show Draft
    
    par Background Check
        Gen->>Critic: Check Draft
        Critic->>UI: Flag detected!
    end
    
    UI->>User: [Pop-up] "Warning: This code may contain a bug."

User Experience: Instant response. Red squiggly line appears 5 seconds later.


The ultimate evolution of the Critic-Generator loop is the Prover-Verifier Game (as seen in OpenAI’s research on math solving).

Instead of one generic critic, you train a Verifier Network on a dataset of “Solution Paths”.

  • Generator: Generates 100 step-by-step solutions to a math problem.
  • Verifier: Scores each step. “Step 1 looks valid.” “Step 2 looks suspect.”
  • Outcome: The system selects the solution path with the highest cumulative verification score.

This is different from a simple Critic because it operates at the Process Level (reasoning steps) rather than the Outcome Level (final answer).

For MLOps, this means logging Traces (steps), not just pairs. Your dataset schema moves from (Input, Output) to (Input, Step1, Step2, Output).


21.2.21. Anti-Patterns in Critique Loops

Just as we discussed routing anti-patterns, critique loops have their own set of failure modes.

1. The Sycophantic Critic

Symptom: The Critic agrees with everything the Generator says, especially when the Generator is a stronger model. Cause: Training data bias. Most instruction-tuned models are trained to be “helpful” and “agreeable”. They are biased towards saying “Yes”. Fix: Break the persona. Don’t say “Critique this.” Say “You are a hostile red-teamer. Find one flaw. If you cannot find a flaw, invent a potential ambiguity.” It is easier to filter out a false-positive critique than to induce a critique from a sycophant.

2. The Nitpicker (Hyper-Correction)

Symptom: The Critic complains about style preferences (“I prefer ‘utilize’ over ‘use’) rather than factual errors. Result: The Refiner rewrites the text 5 times, degrading quality and hitting rate limits. Fix: Enforce Severity Labels. Prompt the Critic to output Severity: Low|Medium|High. In your Python glue code, if severity == 'Low': pass. Only trigger the Refiner for High/Medium issues.

3. The Context Window Overflow

Symptom: Passing the full dialogue history + draft + critique + instructions exceeds the context window (or just gets expensive/slow). Fix: Ephemeral Critique. You don’t need to keep the Critique in the chat history.

  1. Gen Draft.
  2. Gen Critique.
  3. Gen Final.
  4. Save only “User Prompt -> Final” to the database history. Discard the intermediate “thought process” unless you need it for debugging.

21.2.22. Troubleshooting: Common Loop Failures

SymptomDiagnosisTreatment
Loop spins forevermax_retries not set or Refiner keeps triggering new critiques.Implement max_retries=3. Implement temperature=0 for Refiner to ensure stability.
Refiner breaks codeRefiner fixes the logic bug but introduces a syntax error (e.g., missing imports) because it didn’t see the full file.Give Refiner the Full File Context, not just the snippet. Use a Linter/Compiler as a 2nd Critic.
Latency > 15sSequential processing of slow models.Switch to Speculative Decoding or Asynchronous checks. Use smaller models (Haiku/Flash) for the Critic.
“As an AI…”Refiner refuses to generate the fix because the critique touched a safety filter.Tune the Safety Filter (guard-rails) to be context-aware. “Discussing a bug in a bomb-detection script is not the same as building a bomb.”

21.2.23. Reference: Critic System Prompts

Good prompts are the “hyperparameters” of your critique loop. Here are battle-tested examples for common scenarios.

1. The Security Auditor (Code)

Role: AppSec Engineer
Objective: Identify security vulnerabilities in the provided code snippet.
Focus Areas: SQL Injection, XSS, Hardcoded Secrets, Insecure Deserialization.

Instructions:
1. Analyze the logic flow.
2. If a vulnerability exists, output: "VULNERABILITY: [Type] - [Line Number] - [Explanation]".
3. If no vulnerability exists, output: "PASS".
4. Do NOT comment on style or clean code, ONLY security.

2. The Brand Guardian (Tone)

Role: Senior Brand Manager
Objective: Ensure the copy aligns with the "Helpful, Humble, and Human" brand voice.
Guidelines:
- No jargon (e.g., "leverage", "synergy").
- No passive voice.
- Be empathetic but not apologetic.

Draft: {text}

Verdict: [PASS/FAIL]
Critique: (If FAIL, list specific words to change).

3. The Hallucination Hunter (QA)

Role: Fact Checker
Objective: Verify if the Draft Answer is supported by the Retrieved Context.

Retrieved Context:
{context}

Draft Answer:
{draft}

Algorithm:
1. Break Draft into sentences.
2. For each sentence, checks if it is fully supported by Context.
3. If a sentence contains info NOT in Context, flag as HALLUCINATION.

Output:
{"status": "PASS" | "FAIL", "hallucinations": ["list of unsupported claims"]}

4. The Logic Prover (Math/Reasoning)

Role: Math Professor
Objective: Check the steps of the derivation.

Draft Solution:
{draft}

Task:
Go step-by-step.
Step 1: Verify calculation.
Step 2: Verify logic transition.
If any step is invalid, flag it. Do not check the final answer, check the *path*.

21.2.24. Summary Checklist for Critique Loops

To implement reliable self-correction:

  • Dual Models: Ensure the Critic is distinct (or distinctly prompted) from the Generator.
  • Stop Words: Ensure the Critic has clear criteria for “Pass” vs “Fail”.
  • Loop Limit: Hard code a max_retries break to prevent infinite costs.
  • Verification Tools: Give the Critic access to Ground Truth (Search, DB, Calculator) whenever possible.
  • Latency Budget: Decide if the critique happens before the user sees output (Synchronous) or after (Asynchronous/email follow-up).
  • Golden Set: Maintain a dataset of “Known Bad Drafts” to regression test your Critic.
  • Diff-Check: Monitor the edit_distance of refinements to prevent over-correction.

In the next section, 21.3 Consensus Mechanisms, we will look at how to scale this from one critic to a democracy of models voting on the best answer.