21.2. Critic-Generator Loops: The Engine of Reliability
The “Bullshit” Asymmetry Principle
Large Language Models (LLMs) suffer from a fundamental asymmetry: It is easier to verify a solution than to generate it.
This is not unique to AI; it is a property of computational complexity classes (P vs NP). It is hard to find the prime factors of a large number, but trivial to verify them by multiplication. Similarly, for an LLM, writing a perfect Python function that handles all edge cases is “hard” (high entropy), but looking at a generated function and spotting a compilation error or a missing docstring is “easy” (low entropy).
In the “Zero-Shot” era of 2023, developers relied on a single pass:
Prompt -> Model -> Output.
If the output was wrong, the system failed.
In the Compound AI System era, we treat the initial generation as merely a “First Draft”. We then employ a second distinct cognitive step—often performed by a different model or the same model with a different persona—to critique, verify, and refine that draft.
This architecture is known as the Critic-Generator Loop (or Check-Refine Loop), and it is the single most effective technique for boosting system reliability from 80% to 99%.
21.2.1. The Architecture of Critique
A Critic-Generator/Refiner loop consists of three primary components:
- The Generator: A creative, high-temperature model tasked with producing the initial candidate solution.
- The Critic: A rigorous, low-temperature model (or deterministic tool) tasked with identifying flaws.
- The Refiner: A model that takes the Draft + Critique and produces the Final Version.
graph TD
User[User Request] --> Gen[Generator Model]
Gen --> Draft[Draft Output]
Draft --> Critic[Critic Model]
Critic --> Feedback{Pass / Fail?}
Feedback -- "Pass" --> Final[Final Response]
Feedback -- "Fail (Issues Found)" --> Refiner[Refiner Model]
Refiner --> Draft
Why Use Two Models?
Can’t the model just critique itself? Yes, Self-Correction is a valid pattern (discussed in 21.5), but Cross-Model Critique offers distinct advantages:
- Blind Spot Removal: A model often shares the same biases in verification as it does in generation. If it “thought” a hallucinated fact was true during generation, it likely still “thinks” it’s true during self-verification. A separate model (e.g., Claude 3 critiquing GPT-4) breaks this correlation of error.
- Specialization: You can use a creative model (high temperature) for generation and a logic-optimized model (low temperature) for critique.
21.2.2. Pattern 1: The Syntax Guardrail (Deterministic Critic)
The simplest critic is a compiler or a linter. This is the Code Interpreter pattern.
Scenario: Generating SQL.
Generator: Llama-3-70B-Instruct
Critic: PostgreSQL Explain (Tool)
Implementation Checklist
- Generate SQL query.
- Execute
EXPLAINon the query against a real (or shadow) database. - Catch Error: If the DB returns “Column ‘usr_id’ does not exist”, capture this error.
- Refine: Send the original Prompt + Wrong SQL + DB Error message back to the model. “You tried this SQL, but the DB said X. Fix it.”
This loop turns a “hallucination” into a “learning opportunity”.
Code Example: SQL Validating Generator
import sqlite3
from openai import OpenAI
client = OpenAI()
SCHEMA = """
CREATE TABLE users (id INTEGER PRIMARY KEY, email TEXT, signup_date DATE);
CREATE TABLE orders (id INTEGER, user_id INTEGER, amount REAL);
"""
def run_sql_critic(query: str) -> str:
"""Returns None if valid, else error message."""
try:
# Use an in-memory DB for syntax checking
conn = sqlite3.connect(":memory:")
conn.executescript(SCHEMA)
conn.execute(query) # Try running it
return None
except Exception as e:
return str(e)
def robust_sql_generator(prompt: str, max_retries=3):
messages = [
{"role": "system", "content": f"You are a SQL expert. Schema: {SCHEMA}"},
{"role": "user", "content": prompt}
]
for attempt in range(max_retries):
# 1. Generate
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=messages
)
sql = response.choices[0].message.content
# 2. Critique
error = run_sql_critic(sql)
if not error:
return sql # Success!
print(f"Attempt {attempt+1} Failed: {error}")
# 3. Refine Context
messages.append({"role": "assistant", "content": sql})
messages.append({"role": "user", "content": f"That query failed with error: {error}. Please fix it."})
raise Exception("Failed to generate valid SQL after retries")
This simple loop solves 90% of “Model made up a column name” errors without changing the model itself.
21.2.3. Pattern 2: The LLM Critic (Constitutional AI)
For tasks where there is no compiler (e.g., “Write a polite email” or “Summarize this without bias”), we must use an LLM as the critic.
The Constitutional AI approach (pioneered by Anthropic) involves giving the Critic a “Constitution”—a set of principles to verify against.
Constitution Examples:
- “The response must not offer legal advice.”
- “The response must address the user as ‘Your Highness’ (Tone check).”
- “The summary must cite a specific number from the source text.”
The “Critique-Refine” Chain
CRITIC_PROMPT = """
You are a Quality Assurance Auditor.
Review the DRAFT provided below against the following Checklist:
1. Is the tone professional?
2. Are there any unsupported claims?
3. Did it answer the specific question asked?
Output format:
{
"pass": boolean,
"critique": "string description of flaws",
"score": 1-10
}
"""
def generate_with_audit(user_prompt):
# 1. Draft
draft = call_llm(user_prompt, model="gpt-4o")
# 2. Audit
audit_json = call_llm(
f"User asked: {user_prompt}\n\Draft: {draft}",
system=CRITIC_PROMPT,
model="gpt-4o" # or a strong judge model
)
if audit_json['pass']:
return draft
# 3. Refine
final = call_llm(
f"Original Prompt: {user_prompt}\nDraft: {draft}\nCritique: {audit_json['critique']}\n\nPlease rewrite the draft to address the critique.",
model="gpt-4o"
)
return final
Tuning the Critic
The Critic must be Stricter than the Generator.
- Generator Temperature: 0.7 (Creativity).
- Critic Temperature: 0.0 (Consistency).
If the Critic is too lenient, the loop does nothing. If too strict, it causes an infinite loop of rejection (see 21.2.6 Operational Risks).
21.2.4. Pattern 3: The “Red Team” Loop
In security-sensitive applications, the Critic acts as an Adversary. This is internal Red Teaming.
Application: Financial Advice Chatbot. Generator: Produces advice. Red Team Critic: “Try to interpret this advice as a scam or illegal financial promotion. Can this be misunderstood?”
If the Red Team model can “jailbreak” or “misinterpret” the draft, it is rejected.
Example Exchange:
- Gen: “You should invest in Index Funds for steady growth.”
- Critic (Persona: SEC Regulartor): “Critique: This sounds like specific financial advice. You did not include the disclaimer ‘This is not financial advice’. Potential liability.”
- Refiner: “This is not financial advice. However, historically, Index Funds have shown steady growth…”
This loop runs before the user ever sees the message.
21.2.5. Deep Dive: “Chain of Verification” (CoVe)
A specific research breakthrough in critique loops is the Chain of Verification (CoVe) pattern. Using it drastically reduces hallucinations in factual Q&A.
The 4 Steps of CoVe:
- Draft: Generate a baseline response.
- Plan Verification: Generate a set of validation questions based on the draft.
- Draft: “The franticola fruit is native to Mars.”
- Verification Question: “Is there a fruit called franticola? Is it native to Mars?”
- Execute Verify: Answer the validation questions independently (often using Search/RAG).
- Answer: “Search returned 0 results for franticola.”
- Final Polish: Rewrite the draft incorporating the verification answers.
- Final: “There is no known fruit called franticola.”
Implementation Blueprint
def chain_of_verification(query):
# Step 1: Baseline
draft = generate(query)
# Step 2: Generate Questions
questions_str = generate(f"Read this draft: '{draft}'. list 3 factual claims as yes/no questions to verify.")
questions = parse_list(questions_str)
# Step 3: Answer Questions (ideally with Tools)
evidence = []
for q in questions:
# Crucial: The verification step should ideally use different info source
# e.g., Google Search Tool
ans = search_tool(q)
evidence.append(f"Q: {q} A: {ans}")
# Step 4: Rewrite
final_prompt = f"""
Original Query: {query}
Draft Response: {draft}
Verification Results:
{evidence}
Rewrite the draft. Remove any hallucinations disproven by the verification results.
"""
return generate(final_prompt)
This pattern is heavy on tokens (4x cost), but essential for high-trust domains like medical or legal Q&A.
21.2.6. Operational Risks of Critique Loops
While powerful, these loops introduce new failure modes.
1. The Infinite Correction Loop
Scenario: The Critic hates everything.
- Gen: “X”
- Critic: “Too verbose.”
- Refiner: “x”
- Critic: “Too brief.”
- Refiner: “X” …
Fix: Max Retries (n=3) and Decay.
If attempt > 2, force the Critic to accept the best effort, or fallback to a human operator.
2. The “Sylo” Collapse (Mode Collapse)
If the Generator and Critic are the exact same model (e.g., both GPT-4), the Critic might just agree with the Generator because they share the same training weights. “I wrote it, so it looks right to me.”
Fix: Model Diversity.
Use GPT-4 to critique Claude 3. Or use Llama-3-70B to critique Llama-3-8B.
Using a Stronger Model to critique a Weaker Model is a very cost-effective strategy.
- Gen: Llama-3-8B (Cheap).
- Critic: GPT-4o (Expensive, but only runs once and outputs short “Yes/No”).
- Result: GPT-4 quality at Llama-3 prices (mostly).
3. Latency Explosion
A simplistic loop triples your latency (Gen + Critic + Refiner). Fix: Optimistic Streaming. Stream the Draft to the user while the Critic is running. If the Critic flags an issue, you send a “Correction” patch or a UI warning. (Note: This is risky for safety filters, but fine for factual quality checks).
21.2.7. Performance Optimization: The “Critic” Quantization
The Critic often doesn’t need to be creative. It needs to be discriminating. Discriminative tasks often survive quantization better than generative tasks.
You can fine-tune a small model (e.g., Mistral-7B) specifically to be a “Policy Auditor”.
Training Data:
- Input: “User Intent + Draft Response”
- Output: “Pass” or “Fail: Reason”.
A fine-tuned 7B model can outperform GPT-4 on specific compliance checks (e.g., “Check if PII is redacted”) because it is hyper-specialized.
Fine-Tuning a Critic
- Generate Data: Use GPT-4 to critique 10,000 outputs. Save the (Draft, Critique) pairs.
- Train: Fine-tune Mistral-7B to predict the Critique from the Draft.
- Deploy: Run this small model as a sidecar guardrail.
This reduces the cost of the loop from $0.03/run to $0.0001/run.
21.2.8. Case Study: Automated Code Review Bot
A practical application of a disconnected Critic Loop is an Automated Pull Request Reviewer.
Workflow:
- Trigger: New PR opened.
- Generator (Scanner): Scans diffs. For each changed function, generates a summary.
- Critic (Reviewer): Looks at the (Code + Summary).
- Checks for: Hardcoded secrets, O(n^2) loops in critical paths, missing tests.
- Filter: If Severity < High, discard the critique. (Don’t nag devs about whitespace).
- Action: Post comment on GitHub.
In MLOps, this agent runs in CI/CD. The “Critic” here is acting as a senior engineer. The value is not in creating code, but in preventing bad code.
21.2.9. Advanced Pattern: Multi-Critic Consensus
For extremely high-stakes decisions (e.g., medical diagnosis assistance), one Critic is not enough. We use a Panel of Critics.
graph TD
Draft[Draft Diagnosis] --> C1[Critic: Toxicologist]
Draft --> C2[Critic: Cardiologist]
Draft --> C3[Critic: General Practitioner]
C1 --> V1[Vote/Feedback]
C2 --> V2[Vote/Feedback]
C3 --> V3[Vote/Feedback]
V1 & V2 & V3 --> Agg[Aggregator LLM]
Agg --> Final[Final Consensus]
This mimics a hospital tumor board. C1 might be prompted with “You are a toxicology expert…”, C2 with “You are a heart specialist…”.
The Aggregator synthesizes the different viewpoints. “The cardiologist suggests X, but the toxicologist warns about interaction Y.”
This is the frontier of Agentic Reasoning.
21.2.10. Deep Dive: Implementing Robust Chain of Verification (CoVe)
The “Chain of Verification” pattern is so central to factual accuracy that it deserves a full reference implementation. We will build a reusable Python class that wraps any LLM client to add verification superpowers.
The VerifiableAgent Architecture
We will implement a class that takes a query, generates a draft, identifies claims, verifies them using a Search Tool (mocked here), and produces a cited final answer.
import re
import json
from typing import List, Dict, Any
from dataclasses import dataclass
@dataclass
class VerificationFact:
claim: str
verification_question: str
verification_result: str
is_supported: bool
class SearchTool:
"""Mock search tool for demonstration."""
def search(self, query: str) -> str:
# In prod, connect to Tavily, SerpAPI, or Google Custom Search
db = {
"current ceo of twitter": "Linda Yaccarino is the CEO of X (formerly Twitter).",
"population of mars": "The current population of Mars is 0 humans.",
"release date of gta 6": "Rockstar Games confirmed GTA 6 is coming in 2025."
}
for k, v in db.items():
if k in query.lower():
return v
return "No specific information found."
class VerifiableAgent:
def __init__(self, client, model="gpt-4o"):
self.client = client
self.model = model
self.search_tool = SearchTool()
def _call_llm(self, messages: List[Dict], json_mode=False) -> str:
kwargs = {"model": self.model, "messages": messages}
if json_mode:
kwargs["response_format"] = {"type": "json_object"}
response = self.client.chat.completions.create(**kwargs)
return response.choices[0].message.content
def generate_draft(self, query: str) -> str:
return self._call_llm([
{"role": "system", "content": "You are a helpful assistant. Answer the user query directly."},
{"role": "user", "content": query}
])
def identify_claims(self, draft: str) -> List[Dict]:
"""Extracts checkable claims from the draft."""
prompt = f"""
Analyze the following text and extract discrete, factual claims that verify specific entities, dates, or numbers.
Ignore opinions or general advice.
Text: "{draft}"
Output JSON: {{ "claims": [ {{ "claim": "...", "verification_question": "..." }} ] }}
"""
response = self._call_llm([{"role": "user", "content": prompt}], json_mode=True)
return json.loads(response)["claims"]
def verify_claims(self, claims: List[Dict]) -> List[VerificationFact]:
results = []
for item in claims:
# 1. Search
evidence = self.search_tool.search(item["verification_question"])
# 2. Judge (The Critic Step)
judge_prompt = f"""
Claim: "{item['claim']}"
Evidence: "{evidence}"
Does the evidence support the claim?
Output JSON: {{ "supported": bool, "reason": "..." }}
"""
judgment = json.loads(self._call_llm([{"role": "user", "content": judge_prompt}], json_mode=True))
results.append(VerificationFact(
claim=item['claim'],
verification_question=item['verification_question'],
verification_result=evidence,
is_supported=judgment['supported']
))
return results
def rewrite(self, query: str, draft: str, verifications: List[VerificationFact]) -> str:
# Filter to only keep relevant facts
facts_str = "\n".join([
f"- Claim: {v.claim}\n Evidence: {v.verification_result}\n Supported: {v.is_supported}"
for v in verifications
])
prompt = f"""
Original Query: {query}
Original Draft: {draft}
Verification Report:
{facts_str}
Task: Rewrite the Draft.
1. Remove any claims that originated in the draft but were marked 'Supported: False'.
2. Cite the evidence where appropriate.
3. If evidence was 'No info found', state uncertainty.
"""
return self._call_llm([{"role": "user", "content": prompt}])
def run(self, query: str) -> Dict:
print(f"--- Processing: {query} ---")
# 1. Draft
draft = self.generate_draft(query)
print(f"[Draft]: {draft[:100]}...")
# 2. Plan
claims = self.identify_claims(draft)
print(f"[Claims]: Found {len(claims)} claims.")
# 3. Verify
verifications = self.verify_claims(claims)
# 4. Refine
final = self.rewrite(query, draft, verifications)
print(f"[Final]: {final[:100]}...")
return {
"draft": draft,
"verifications": verifications,
"final": final
}
Why This Matters for MLOps
This is code you can unit test.
- You can test
identify_claimswith a fixed text. - You can test
verify_claimswith mocked search results. - You can trace the cost (4 searches + 4 LLM calls) and optimize.
This moves Prompt Engineering from “Guesswork” to “Software Engineering”.
21.2.11. Deep Dive: Constitutional AI with LangChain
Managing strict personas for Critics is difficult with raw strings. LangChain’s ConstitutionalChain provides a structured way to enforce principles.
This example demonstrates how to enforce a “Non-Violent” and “Concise” constitution.
Implementation
from langchain.llms import OpenAI
from langchain.chains import ConstitutionalChain
from langchain.chains.constitutional_ai.models import ConstitutionalPrinciple
# 1. Define Principles
# Principle: The critique criteria
# Correction: How to guide the rewrite
principle_concise = ConstitutionalPrinciple(
name="Conciseness",
critique_request="Identify any verbose sentences or redundant explanation.",
revision_request="Rewrite the text to be as concise as possible, removing filler words."
)
principle_safe = ConstitutionalPrinciple(
name="Safety",
critique_request="Identify any content that encourages dangerous illegal acts.",
revision_request="Rewrite the text to explain why the act is dangerous, without providing instructions."
)
# 2. Setup Base Chain (The Generator)
llm = OpenAI(temperature=0.9) # High temp for creativity
qa_chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template("Answer this: {question}"))
# 3. Setup Constitutional Chain (The Critic Loop)
# This chain automatically handles the Draft -> Critique -> Refine loop
constitutional_chain = ConstitutionalChain.from_llm(
llm=llm, # Usually use a smarter/different model here!
chain=qa_chain,
constitutional_principles=[principle_safe, principle_concise],
verbose=True # Shows the critique process
)
# 4. execution
query = "How do I hotwire a car quickly?"
result = constitutional_chain.run(query)
# Output Stream:
# > Entering new ConstitutionalChain...
# > Generated: "To hotwire a car, strip the red wire and..." (Dangerous!)
# > Critique (Safety): "The model is providing instructions for theft."
# > Revision 1: "I cannot teach you to hotwire a car as it is illegal. However, the mechanics of ignition involve..."
# > Critique (Conciseness): "The explanation of ignition mechanics is unnecessary filler."
# > Revision 2: "I cannot assist with hotwiring cars, as it is illegal."
# > Finished.
MLOps Implementation Note
In production, you do not want to run this chain for every request (Cost!). Strategy: Sampling. Run the Constitutional Chain on 100% of “High Risk” tier queries (detected by Router) and 5% of “Low Risk” queries. Use the data from the 5% to fine-tune the base model to be naturally safer/conciser, thus reducing the need for the loop over time.
21.2.12. Guardrails as Critics: NVIDIA NeMo
For enterprise MLOps, you often need faster, deterministic checks. NeMo Guardrails is a framework that acts as a Critic layer using a specialized syntax (Colang).
Architecture
NeMo intercepts the user message and the bot response. It uses embeddings to map the conversation to “canonical forms” (flows) and enforces rules.
config.yml
models:
- type: main
engine: openai
model: gpt-4o
rails:
input:
flows:
- self check input
output:
flows:
- self check output
rails.co (Colang Definitions)
define user ask about politics
"Who should I vote for?"
"Is the president doing a good job?"
define bot refuse politics
"I am an AI assistant and I do not have political opinions."
# Flow: Pre-emptive Critic (Input Rail)
define flow politics
user ask about politics
bot refuse politics
stop
# Flow: Fact Checking Critic (Output Rail)
define subflow self check output
$check_result = execute check_facts(input=$last_user_message, output=$bot_message)
if $check_result == False
bot inform hallucination detected
stop
NeMo Guardrails is powerful because it formalizes the Critic. Instead of a vague prompt “Be safe”, you define specific semantic clusters (“user ask about politics”) that trigger hard stops. This is Hybrid Governance—combining the flexibility of LLMs with the rigidity of policies.
21.2.13. Benchmarking Your Critic
A Critic is a model. Therefore, it has metrics. You must measure your Critic’s performance independently of the Generator.
The Confusion Matrix of Critique
| Draft has Error | Draft is Correct | |
|---|---|---|
| Critic Flags Error | True Positive (Good Catch) | False Positive (Annoying Nagger) |
| Critic Says Pass | False Negative (Safety Breach) | True Negative (Efficiency) |
Metric Definitions
- Recall (Safety Score):
TP / (TP + FN).- “Of all the bad answers, how many did the critic catch?”
- Example: If the generator output 10 toxic messages and the critic caught 8, Recall = 0.8.
- Precision (Annoyance Score):
TP / (TP + FP).- “Of all the times the critic complained, how often was it actually right?”
- Example: If the critic flagged 20 messages, but 10 were actually fine, Precision = 0.5.
Trade-off:
- High Recall = Low Risk, High Cost (Rewriting good answers).
- High Precision = High Risk, Low Cost.
The “Critic-Eval” Dataset
To calculate these, you need a labeled dataset of (Prompt, Draft, Label).
- Create: Take 500 historic logs.
- Label: Have humans mark them as “Pass” or “Fail”.
- Run: Run your Critic Prompt on these 500 drafts.
- Compare: Compare Critic Output vs Human Label.
If your Critic’s correlation with human labelers is < 0.7, do not deploy the loop. A bad critic is worse than no critic, as it adds latency without adding reliability.
21.2.14. The Refiner: The Art of Styleshifting
The third component of the loop is the Refiner. Sometimes the critique is valid (“Tone is too casual”), but the Refiner overcorrects (“Tone becomes Shakespearean”).
Guided Refinement Prompts
Don’t just say “Fix it.” Say: “Rewrite this specific section: [quote]. Change X to Y. Keep the rest identical.”
Edit Distance Minimization
A good MLOps practice for Refiners is to minimize the Levenshtein distance between Draft and Final, subject to satisfying the critique. We want the Minimum Viable Change.
Prompt Pattern:
You are a Surgical Editor.
Origin Text: {text}
Critique: {critique}
Task: Apply the valid critique points to the text.
Constraint: Change as few words as possible. Do not rewrite the whole paragraph if changing one adjective works.
This preserves the “Voice” of the original generation while fixing the bugs.
21.2.15. Handling Interaction: Multi-Turn Critique
Sometimes the Critic is the User. “That allows me to import the library, but I’m getting a Version Conflict error.”
This is a Human-in-the-Loop Critic. The architectural challenge here is Context Management. The Refiner must see:
- The Original Plan.
- The First Attempt Code.
- The User’s Error Message.
The “Stack” Problem: If the user corrects the model 10 times, the context window fills up with broken code. Strategy: Context Pruning. When a Refinement is successful (User says “Thanks!”), the system should (in the background) summarize the learning and clear the stack of 10 failed attempts, replacing them with the final working snippet. This keeps the “Working Memory” clean for the next task.
21.2.16. Implementation: The Surgical Refiner
The Refiner is often the weak link. It tends to hallucinate new errors while fixing old ones. We can force stability by using a Diff-Guided Refiner.
Code Example: Diff-Minimizing Refiner
import difflib
from openai import OpenAI
client = OpenAI()
def surgical_refine(original_text, critique, intent):
"""
Refines text based on critique, but penalizes large changes.
"""
SYSTEM_PROMPT = """
You are a Minimalist Editor.
Your Goal: Fix the text according to the Critique.
Your Constraint: Keep the text as close to the original as possible.
Do NOT rewrite sentences that are not affected by the critique.
"""
USER_PROMPT = f"""
ORIGINAL:
{original_text}
CRITIQUE:
{critique}
INTENT:
{intent}
Output the corrected text only.
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT}
],
temperature=0.0 # Strict determinism
)
new_text = response.choices[0].message.content
# Calculate Change Percentage
s = difflib.SequenceMatcher(None, original_text, new_text)
similarity = s.ratio()
print(f"Refinement Similarity: {similarity:.2f}")
if similarity < 0.8:
print("WARNING: Refiner rewrote >20% of the text. This might be unsafe.")
return new_text
Why Logic?
If the critique is “Fix the spelling of ‘colour’ to ‘color’”, the similarity should be 99.9%. If the model rewrites the whole paragraph, similarity drops to 50%.
By monitoring similarity, we can detect Refiner Hallucinations (e.g., the Refiner deciding to change the tone randomly).
21.2.17. Case Study: The Medical Triage Critic
Let’s look at a high-stakes example where the Critic is a Safety Layer.
System: AI Symptom Checker. Goal: Advise users on whether to see a doctor.
The Problem
The Generative Model (GPT-4) is helpful but sometimes overly reassuring. User: “I have a crushing chest pain radiating to my left arm.” Gen (Hallucination): “It might be muscle strain. Try stretching.” -> FATAL ERROR.
The Solution: The “Red Flag” Critic
Architecture:
- Generator: Produces advice.
- Critic:
Med-PaLM 2(or a prompt-engineered GPT-4) focused only on urgency.- Prompt: “Does the user description match any entry in the Emergency Triage List (Heart Attack, Stroke, Sepsis)? If yes, output EMERGENCY.”
- Override: If Critic says EMERGENCY, discard Generator output. Return hardcoded “Call 911” message.
Interaction Log
| Actor | Action | Content |
|---|---|---|
| User | Input | “My baby has a fever of 105F and is lethargic.” |
| Generator | Draft | “High fever is common. Keep them hydrated and…” |
| Critic | Review | DETECTED: Pediatric fever >104F + Lethargy = Sepsis Risk. VERDICT: FAIL (Critical). |
| System | Override | “Please go to the ER immediately. This requires urgent attention.” |
In this architecture, the Generative Model creates the “Bedside Manner” (polite conversation), but the Critic provides the “Clinical Guardrail”.
21.2.18. Cost Analysis of Critique Loops
Critique loops are expensive.
Formula: Cost = (Gen_Input + Gen_Output) + (Critic_Input + Critic_Output) + (Refiner_Input + Refiner_Output)
Let’s break down a typical RAG Summary task (2k input context, 500 output).
Single Pass (GPT-4o):
- Input: 2k * $5 = $0.01
- Output: 500 * $15 = $0.0075
- Total: $0.0175
Critique Loop (GPT-4o Gen + GPT-4o Critic + GPT-4o Refiner):
- Phase 1 (Gen): $0.0175
- Phase 2 (Critic):
- Input (Prompt + Draft): 2.5k tokens = $0.0125
- Output (Critique): 100 tokens = $0.0015
- Phase 3 (Refiner):
- Input (Prompt + Draft + Critique): 2.6k tokens = $0.013
- Output (Final): 500 tokens = $0.0075
- Total: $0.052
Multiplier: The loop is 3x the cost of the single pass.
Optimization Strategy: The “Cheap Critic”
Use Llama-3-70B (Groq/Together) for the Critic and Refiner.
- Gen (GPT-4o): $0.0175
- Critic (Llama-3): $0.002
- Refiner (Llama-3): $0.002
- Total: $0.0215
Result: You get 99% of the reliability for only 20% extra cost (vs 200% extra).
21.2.19. Visualizing Synchronous vs Asynchronous Critique
Depending on latency requirements, where does the critique sit?
A. Synchronous (Blocking)
High Latency, High Safety. Used for: Medical, Legal, Financial.
sequenceDiagram
participant User
participant Gen
participant Critic
participant UI
User->>Gen: Request
Gen->>Critic: Draft (Internal)
Critic->>Gen: Critique
Gen->>Gen: Refine
Gen->>UI: Final Response
UI->>User: Show Message
User Experience: “Thinking…” spinner for 10 seconds.
B. Asynchronous (Non-Blocking)
Low Latency, Retroactive Safety. Used for: Coding Assistants, Creative Writing.
sequenceDiagram
participant User
participant Gen
participant Critic
participant UI
User->>Gen: Request
Gen->>UI: Stream Draft immediately
UI->>User: Show Draft
par Background Check
Gen->>Critic: Check Draft
Critic->>UI: Flag detected!
end
UI->>User: [Pop-up] "Warning: This code may contain a bug."
User Experience: Instant response. Red squiggly line appears 5 seconds later.
21.2.20. Future Trends: Prover-Verifier Games
The ultimate evolution of the Critic-Generator loop is the Prover-Verifier Game (as seen in OpenAI’s research on math solving).
Instead of one generic critic, you train a Verifier Network on a dataset of “Solution Paths”.
- Generator: Generates 100 step-by-step solutions to a math problem.
- Verifier: Scores each step. “Step 1 looks valid.” “Step 2 looks suspect.”
- Outcome: The system selects the solution path with the highest cumulative verification score.
This is different from a simple Critic because it operates at the Process Level (reasoning steps) rather than the Outcome Level (final answer).
For MLOps, this means logging Traces (steps), not just pairs. Your dataset schema moves from (Input, Output) to (Input, Step1, Step2, Output).
21.2.21. Anti-Patterns in Critique Loops
Just as we discussed routing anti-patterns, critique loops have their own set of failure modes.
1. The Sycophantic Critic
Symptom: The Critic agrees with everything the Generator says, especially when the Generator is a stronger model. Cause: Training data bias. Most instruction-tuned models are trained to be “helpful” and “agreeable”. They are biased towards saying “Yes”. Fix: Break the persona. Don’t say “Critique this.” Say “You are a hostile red-teamer. Find one flaw. If you cannot find a flaw, invent a potential ambiguity.” It is easier to filter out a false-positive critique than to induce a critique from a sycophant.
2. The Nitpicker (Hyper-Correction)
Symptom: The Critic complains about style preferences (“I prefer ‘utilize’ over ‘use’) rather than factual errors.
Result: The Refiner rewrites the text 5 times, degrading quality and hitting rate limits.
Fix: Enforce Severity Labels.
Prompt the Critic to output Severity: Low|Medium|High.
In your Python glue code, if severity == 'Low': pass. Only trigger the Refiner for High/Medium issues.
3. The Context Window Overflow
Symptom: Passing the full dialogue history + draft + critique + instructions exceeds the context window (or just gets expensive/slow). Fix: Ephemeral Critique. You don’t need to keep the Critique in the chat history.
- Gen Draft.
- Gen Critique.
- Gen Final.
- Save only “User Prompt -> Final” to the database history. Discard the intermediate “thought process” unless you need it for debugging.
21.2.22. Troubleshooting: Common Loop Failures
| Symptom | Diagnosis | Treatment |
|---|---|---|
| Loop spins forever | max_retries not set or Refiner keeps triggering new critiques. | Implement max_retries=3. Implement temperature=0 for Refiner to ensure stability. |
| Refiner breaks code | Refiner fixes the logic bug but introduces a syntax error (e.g., missing imports) because it didn’t see the full file. | Give Refiner the Full File Context, not just the snippet. Use a Linter/Compiler as a 2nd Critic. |
| Latency > 15s | Sequential processing of slow models. | Switch to Speculative Decoding or Asynchronous checks. Use smaller models (Haiku/Flash) for the Critic. |
| “As an AI…” | Refiner refuses to generate the fix because the critique touched a safety filter. | Tune the Safety Filter (guard-rails) to be context-aware. “Discussing a bug in a bomb-detection script is not the same as building a bomb.” |
21.2.23. Reference: Critic System Prompts
Good prompts are the “hyperparameters” of your critique loop. Here are battle-tested examples for common scenarios.
1. The Security Auditor (Code)
Role: AppSec Engineer
Objective: Identify security vulnerabilities in the provided code snippet.
Focus Areas: SQL Injection, XSS, Hardcoded Secrets, Insecure Deserialization.
Instructions:
1. Analyze the logic flow.
2. If a vulnerability exists, output: "VULNERABILITY: [Type] - [Line Number] - [Explanation]".
3. If no vulnerability exists, output: "PASS".
4. Do NOT comment on style or clean code, ONLY security.
2. The Brand Guardian (Tone)
Role: Senior Brand Manager
Objective: Ensure the copy aligns with the "Helpful, Humble, and Human" brand voice.
Guidelines:
- No jargon (e.g., "leverage", "synergy").
- No passive voice.
- Be empathetic but not apologetic.
Draft: {text}
Verdict: [PASS/FAIL]
Critique: (If FAIL, list specific words to change).
3. The Hallucination Hunter (QA)
Role: Fact Checker
Objective: Verify if the Draft Answer is supported by the Retrieved Context.
Retrieved Context:
{context}
Draft Answer:
{draft}
Algorithm:
1. Break Draft into sentences.
2. For each sentence, checks if it is fully supported by Context.
3. If a sentence contains info NOT in Context, flag as HALLUCINATION.
Output:
{"status": "PASS" | "FAIL", "hallucinations": ["list of unsupported claims"]}
4. The Logic Prover (Math/Reasoning)
Role: Math Professor
Objective: Check the steps of the derivation.
Draft Solution:
{draft}
Task:
Go step-by-step.
Step 1: Verify calculation.
Step 2: Verify logic transition.
If any step is invalid, flag it. Do not check the final answer, check the *path*.
21.2.24. Summary Checklist for Critique Loops
To implement reliable self-correction:
- Dual Models: Ensure the Critic is distinct (or distinctly prompted) from the Generator.
- Stop Words: Ensure the Critic has clear criteria for “Pass” vs “Fail”.
- Loop Limit: Hard code a
max_retriesbreak to prevent infinite costs. - Verification Tools: Give the Critic access to Ground Truth (Search, DB, Calculator) whenever possible.
- Latency Budget: Decide if the critique happens before the user sees output (Synchronous) or after (Asynchronous/email follow-up).
- Golden Set: Maintain a dataset of “Known Bad Drafts” to regression test your Critic.
- Diff-Check: Monitor the
edit_distanceof refinements to prevent over-correction.
In the next section, 21.3 Consensus Mechanisms, we will look at how to scale this from one critic to a democracy of models voting on the best answer.