Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

21.3 Automated Prompt Optimization (APO)

We have managed prompts (21.1) and measured them (21.2). Now we Optimize them.

Manual Prompt Engineering is tedious. “Act as a pirate… no, a polite pirate… no, a helpful polite pirate.” This is Stochastic Gradient Descent by Hand. It is inefficient. We should let the machine do it.


1. The Paradigm Shift: Prompts are Weights

In Traditional ML, we don’t hand-write the weights of a Neural Network. We define a Loss Function and let the optimizer find the weights. In GenAI, the “Prompt” is just a set of discrete weights (tokens) that condition the model. APO treats the Prompt as a learnable parameter.

$\text{Prompt}_{t+1} = \text{Optimize}(\text{Prompt}_t, \text{Loss})$


2. APE: Automatic Prompt Engineer

The paper that started it (Zhou et al., 2022). Idea: Use an LLM to generate prompts, score them, and select the best.

2.1. The Algorithm

  1. Proposal: Ask GPT-4: “Generate 50 instruction variations for this task.”
    • Input: “Task: Add two numbers.”
    • Variations: “Sum X and Y”, “Calculate X+Y”, “You are a calculator…”
  2. Scoring: Run all 50 prompts on a Validation Set. Calculate Accuracy.
  3. Selection: Pick the winner.

2.2. APE Implementation Code

def ape_optimize(task_description, eval_dataset):
    # 1. Generate Candidates
    candidates = gpt4.generate(
        f"Generate 10 distinct prompt instructions for: {task_description}"
    )
    
    leaderboard = []
    
    # 2. Score Candidates
    for prompt in candidates:
        score = run_eval(prompt, eval_dataset)
        leaderboard.append((score, prompt))
        
    # 3. Sort
    leaderboard.sort(reverse=True)
    return leaderboard[0]

Result: APE often finds prompts that humans wouldn’t think of.

  • Human: “Think step by step.”
  • APE: “Let’s work this out in a step by step way to be sure we have the right answer.” (Often 2% better).

3. DSPy: The Compiler for Prompts

DSPy (Stanford NLP) is the biggest leap in PromptOps. It stops treating prompts as strings and treats them as Modules.

3.1. The “Teleprompter” (Optimizer)

You define:

  1. Signature: Input -> Output.
  2. Module: ChainOfThought.
  3. Metric: Accuracy.

DSPy compiles this into a prompt. It can automatically find the best “Few-Shot Examples” to include in the prompt to maximize the metric.

3.2. Code Deep Dive: DSPy RAG

import dspy
from dspy.teleprompt import BootstrapFewShot

# 1. Setup LM
turbo = dspy.OpenAI(model='gpt-3.5-turbo')
dspy.settings.configure(lm=turbo)

# 2. Define Signature (The Interface)
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

# 3. Define Module (The Logic)
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

# 4. Compile (Optimize)
# We need a small training set of (Question, Answer) pairs.
trainset = [ ... ] 

teleprompter = BootstrapFewShot(metric=dspy.evaluate.answer_exact_match)
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

# 5. Run
pred = compiled_rag("Where is Paris?")

What just happened? BootstrapFewShot ran the pipeline. It tried to answer the questions. If it got a question right, it saved that (Question, Thought, Answer) trace. It then added that trace as a Few-Shot Example to the prompt. It effectively “bootstrapped” its own training data to improve the prompt.


4. TextGrad: Gradient Descent for Text

A newer approach (2024). It uses the “Feedback” (from the Judge) to explicitly edit the prompt.

4.1. The Critic Loop

  1. Prompt: “Write a poem.”
  2. Output: “Roses are red…”
  3. Judge: “Too cliché. Score 2/5.”
  4. Optimizer (Gradient): Ask LLM: “Given the prompt, output, and critique, how should I edit the prompt to improve the score?”
  5. Edit: “Write a poem using avant-garde imagery.”
  6. Loop.

4.2. TextGrad Operations

This is more expensive than APE because it requires an LLM call per iteration step. But it can solve complex failures.


5. Ops Architecture: The Optimization Pipeline

Where does this fit in CI/CD?

graph TD
    Dev[Developer] -->|Commit| Git
    Git -->|Trigger| CI
    CI -->|Load| BasePrompt[Base Prompt v1]
    
    subgraph Optimization
        BasePrompt -->|Input| DSPy[DSPy Compiler]
        Data[Golden Set] -->|Training| DSPy
        DSPy -->|Iterate| Candidates[Prompt Candidates]
        Candidates -->|Eval| Best[Best Candidate v1.1]
    end
    
    Best -->|Commit| GitBack[Commit Optimized Prompt]

The “Prompt Tweak” PR: Your CI system can automatically open a PR: “Optimized Prompt (Accuracy +4%)”. The Developer just reviews and merges.


6. Case Study: Optimizing a summarizer

Task: Summarize Legal Contracts. Baseline: “Summarize this: {text}” -> Accuracy 60%.

Step 1: Metric Definition We define Coherence and Coverage.

Step 2: DSPy Optimization We run BootstrapFewShot. DSPy finds 3 examples where the model successfully summarized a contract. It appends these to the prompt. Result: Prompt becomes ~2000 tokens long (including examples). Accuracy -> 75%.

Step 3: Signature Optimization We run COPRO (Chain of Thought Prompt Optimization). DSPy rewrites the instruction: “You are a legal expert. Extract the indemnification clauses first, then summarize…” Result: Accuracy -> 82%.

Timeline:

  • Manual Engineering: 2 days.
  • APO: 30 minutes.

7. The Cost of Optimization

APO is not free. To compile a DSPy module, you might make 500-1000 API calls (Generating traces, evaluating them). Cost: ~$5 - $20 per compile. ROI: If you gain 5% accuracy on a production app serving 1M users, $20 is nothing.

Ops Rule:

  • Run APO on Model Upgrades (e.g. switching from GPT-3.5 to GPT-4).
  • Run APO on Data Drift (if user queries change).
  • Do not run APO on every commit.

In the next section, we dive into Advanced Evaluation: Red Teaming (21.4). Because an optimized prompt might also be an unsafe prompt. Optimization often finds “Shortcuts” (Cheats) that satisfy the metric but violate safety.


8. Deep Dive: DSPy Modules

DSPy abstracts “Prompting Techniques” into standard software Modules. Just as PyTorch has nn.Linear and nn.Conv2d, DSPy has dspy.Predict and dspy.ChainOfThought.

8.1. dspy.Predict

The simplest atomic unit. It behaves like a Zero-Shot prompt.

  • Behavior: Takes input fields, formats them into a string, calls LLM, parses output fields.
  • Optimization: Can learn Instructions and Demonstrations.

8.2. dspy.ChainOfThought

Inherits from Predict, but injects a “Reasoning” field.

  • Signature: Input -> Output becomes Input -> Rationale -> Output.
  • Behavior: Forces the model to generate “Reasoning: Let’s think step by step…” before the answer.
  • Optimization: The compiler can verify if the Rationale actually leads to the correct Answer.

8.3. dspy.ReAct

Used for Agents (Tool Use).

  • Behavior: Loop of Thought -> Action -> Observation.
  • Ops: Managing the tool outputs (e.g. SQL results) is handled automatically.

8.4. dspy.ProgramOfThought

Generates Python code to solve the problem (PAL pattern).

  • Behavior: Input -> Python Code -> Execution Result -> Output.
  • Use Case: Math, Date calculations (“What is the date 30 days from now?”).

9. DSPy Optimizers (Teleprompters)

The “Teleprompter” is the optimizer that learns the prompt. Which one should you use?

9.1. BootstrapFewShot

  • Strategy: “Teacher Forcing”.
  • Run the pipeline on the Training Set.
  • Keep the traces where $Prediction == Truth$.
  • Add these traces as Few-Shot examples.
  • Cost: Low (1 pass).
  • Best For: When you have > 10 training examples.

9.2. BootstrapFewShotWithRandomSearch

  • Strategy: Bootstraps many sets of few-shot examples.
  • Then runs a Random Search on the Validation Set to pick the best combination.
  • Cost: Medium (Requires validation runs).
  • Best For: Squeezing out extra 2-3% accuracy.

9.3. MIPRO (Multi-Hop Instruction Prompt Optimization)

  • Strategy: Optimizes both the Instructions (Data-Aware) and the Few-Shot examples.
  • Uses a Bayesian optimization approach (TPE) to search the prompt space.
  • Cost: High (Can take 50+ runs).
  • Best For: Complex, multi-stage pipelines where instructions matter more than examples.

10. Genetic Prompt Algorithms

Before DSPy, we had Evolutionary Algorithms. “Survival of the Fittest Prompts.”

10.1. The PromptBreeder Algorithm

  1. Population: Start with 20 variations of the prompt.
  2. Fitness: Evaluate all 20 on the Validation Set.
  3. Survival: Kill the bottom 10.
  4. Mutation: Ask an LLM to “Mutate” the top 10.
    • Mutation Operators: “Rephrase”, “Make it shorter”, “Add an analogy”, “Mix Prompt A and B”.
  5. Repeat.

10.2. Why not Gradient Descent?

Standard Gradient Descent (Backprop) doesn’t work on discrete tokens (Prompts are non-differentiable). Genetic Algorithms work well on discrete search spaces.

10.3. Implementation Skeleton

def mutate(prompt):
    return llm.generate(f"Rewrite this prompt to be more concise: {prompt}")

def evolve(population, generations=5):
    for gen in range(generations):
        scores = [(score(p), p) for p in population]
        scores.sort(reverse=True)
        
        survivors = [p for s, p in scores[:10]]
        children = [mutate(p) for p in survivors]
        
        population = survivors + children
        print(f"Gen {gen} Best Score: {scores[0][0]}")
        
    return population[0]

11. Hands-On Lab: DSPy RAG Pipeline

Let’s build and compile a real RAG pipeline.

Step 1: Data Preparation

We need (question, answer) pairs. We can use a subset of HotPotQA.

from dspy.datasets import HotPotQA
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50)
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

Step 2: Define Logic (RAG Module)

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

Step 3: Define Metric

We check if the prediction matches the ground truth answer exactly.

def validate_answer(example, pred, trace=None):
    return dspy.evaluate.answer_exact_match(example, pred)

Step 4: Compile

from dspy.teleprompt import BootstrapFewShot

teleprompter = BootstrapFewShot(metric=validate_answer)
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

Step 5: Save Optimized Prompt

In Traditional ML, we save .pt weights. In DSPy, we save .json configuration (Instructions + Examples).

compiled_rag.save("rag_optimized_v1.json")

Ops Note: Checking rag_optimized_v1.json into Git is the “Golden Artifact” of APO.


12. Troubleshooting APO

Symptom: Optimization fails (0% improvement).

  • Cause: Training set is too small (< 10 examples).
  • Cause: The Metric is broken (Judge is hallucinating). If the metric is random, the optimizer sees noise.
  • Fix: Validate your Metric with 21.2 techniques first.

Symptom: Massive token usage cost.

  • Cause: BootstrapFewShot typically runs N_Train * 2 calls. if N=1000, that’s 2000 calls.
  • Fix: Use a small subset (N=50) for compilation. It generalizes surprisingly well.

Symptom: Overfitting.

  • Cause: The prompt learned to solve the specific 50 training questions perfectly (by memorizing), but fails on new questions.
  • Fix: Validate on a held-out Dev Set. If Dev score drops, you are overfitting.

In the next section, we assume the prompt is optimized. But now we worry: Is it safe? We begin Chapter 21.4: Red Teaming.


13. Advanced Topic: Multi-Model Optimization

A powerful pattern in APO is using a Teacher Model (GPT-4) to compile prompts for a Student Model (Llama-3-8B).

13.1. The Distillation Pattern

Llama-3-8B is smart, but it often misunderstands complex instructions. GPT-4 is great at writing clear, simple instructions.

Workflow:

  1. Configure DSPy: Set teacher=gpt4, student=llama3.
  2. Compile: teleprompter.compile(student, teacher=teacher).
  3. Process:
    • DSPy uses GPT-4 to generate the “Reasoning Traces” (CoT) for the training set.
    • It verifies these traces lead to the correct answer.
    • It injects these GPT-4 thoughts as Few-Shot examples into the Llama-3 prompt.
  4. Result: Llama-3 learns to “mimic” the reasoning style of GPT-4 via In-Context Learning.

Benefit: You get GPT-4 logical performance at Llama-3 inference cost.


14. DSPy Assertions: Runtime Guardrails

Optimized prompts can still fail. DSPy allows Assertions (like Python assert) that trigger self-correction retry loops during inference.

14.1. Constraints

import dspy

class GenerateSummary(dspy.Signature):
    """Summarize the text in under 20 words."""
    text = dspy.InputField()
    summary = dspy.OutputField()

class Summarizer(dspy.Module):
    def __init__(self):
        self.generate = dspy.Predict(GenerateSummary)
        
    def forward(self, text):
        pred = self.generate(text=text)
        
        # 1. Assertion
        dspy.Suggest(
            len(pred.summary.split()) < 20,
            f"Summary is too long ({len(pred.summary.split())} words). Please retry."
        )
        
        # 2. Hard Assertion (Fail if retries fail)
        dspy.Assert(
            "confident" not in pred.summary,
            "Do not use the word 'confident'."
        )
        
        return pred

14.2. Behaviors

  • Suggest: If false, DSPy backtracks. It calls the LLM again, appending the error message “Summary is too long…” to the history. (Soft Correction).
  • Assert: If false after N retries, raise an Exception (Hard Failure).

Ops Impact: Assertions increase latency (due to retries) but dramatically increase reliability. It is “Exception Handling for LLMs”.


15. Comparison: DSPy vs. The World

Why choose DSPy over LangChain?

FeatureLangChainLlamaIndexDSPy
PhilosophyCoding FrameworkData FrameworkOptimizer Framework
PromptsHand-written stringsHand-written stringsAuto-compiled weights
FocusInteraction / AgentsRetrieval / RAGAccuracy / Metrics
Abstraction“Chains” of logic“Indices” of data“Modules” of layers
Best ForBuilding an AppSearching DataMaximizing Score

Verdict:

  • Use LangChain to build the API / Tooling.
  • Use DSPy to define the logic inside the LangChain nodes.
  • You can wrap a DSPy module inside a LangChain Tool.

16. The Future of APO: OPRO (Optimization by PROmpting)

DeepMind (Yang et al., 2023) proposed OPRO. It replaces the “gradient update” entirely with natural language.

The Loop:

  1. Meta-Prompt: “You are an optimizer. Here are the past 5 prompts and their scores. Propose a new prompt that is better.”
  2. History:
    • P1: “Solve X” -> 50%.
    • P2: “Solve X step by step” -> 70%.
  3. Generation: “Solve X step by step, and double check your math.” (P3).
  4. Eval: P3 -> 75%.
  5. Update: Add P3 to history.

Ops Implication: Evolving prompts will become a continuous background job, running 24/7 on your evaluation cluster. “Continuous Improvement” becomes literal.


17. Bibliography

1. “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”

  • Khattab et al. (Stanford) (2023): The seminal paper.

2. “Large Language Models Are Human-Level Prompt Engineers (APE)”

  • Zhou et al. (2022): Introduced the concept of automated instruction generation.

3. “Large Language Models as Optimizers (OPRO)”

  • Yang et al. (DeepMind) (2023): Optimization via meta-prompting.

18. Final Checklist: The Evaluation

We have finished the “Prompt Operations” trilogy (21.1, 21.2, 21.3). We have moved from:

  1. Magic Strings (21.1) -> Versioned Artifacts.
  2. Vibe Checks (21.2) -> Numeric Scores.
  3. Manual Tuning (21.3) -> Automated Compilation.

The system is now robust, measurable, and self-improving. But there is one final threat. The User. Users are adversarial. They will try to break your optimized prompt. We need Red Teaming.

Proceed to Chapter 21.4: Advanced Evaluation: Red Teaming Ops.


To truly understand APO, let’s build a mini-optimizer that improves a prompt using Feedback.

import openai

class SimpleOptimizer:
    def __init__(self, task_description, train_examples):
        self.task = task_description
        self.examples = train_examples
        self.history = [] # List of (prompt, score)
        
    def evaluate(self, instruction):
        """Mock evaluation loop."""
        score = 0
        for ex in self.examples:
            # P(Answer | Instruction + Input)
            # In real life, call LLM here.
            score += 1 # Mock pass
        return score / len(self.examples)
        
    def step(self):
        """The Optimization Step (Meta-Prompting)."""
        
        # 1. Create Meta-Prompt
        history_text = "\n".join([f"Prompt: {p}\nScore: {s}" for p, s in self.history])
        
        meta_prompt = f"""
        You are an expert Prompt Engineer.
        Your goal is to write a better instruction for: "{self.task}".
        
        History of attempts:
        {history_text}
        
        Propose a new, diverse instruction that is likely to score higher.
        Output ONLY the instruction.
        """
        
        # 2. Generate Candidate
        candidate = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": meta_prompt}]
        ).choices[0].message.content
        
        # 3. Score
        score = self.evaluate(candidate)
        self.history.append((candidate, score))
        
        print(f"Candidate: {candidate[:50]}... Score: {score}")
        return candidate, score

# Usage
opt = SimpleOptimizer(
    task_description="Classify sentiment of tweets",
    train_examples=[("I hate this", "Neg"), ("I love this", "Pos")]
)

for i in range(5):
    opt.step()

20. Conceptual Mapping: DSPy vs. PyTorch

If you are an ML Engineer, DSPy makes sense when you map it to PyTorch types.

PyTorch ConceptDSPy ConceptDescription
TensorFieldsThe input/output data (Strings instead of Floats).
Layer (nn.Linear)Module (dspy.Predict)Transformation logic.
WeightsPrompts (Instructions)The learnable parameters.
DatasetExample (dspy.Example)Training data.
Loss FunctionMetricFunction (gold, pred) -> float.
Optimizer (Adam)TeleprompterAlgorithm to update weights.
Training LoopCompileThe process of running the optimizer.
Inference (model(x))Forward CallUsing the compiled prompt.

The Epiphany: Prompt Engineering is just “Manual Weight Initialization”. APO is “Training”.


21. Ops Checklist: When to Compile?

DSPy adds a “Compilation” step to your CI/CD. When should this run?

21.1. The “Daily Build” Pattern

  • Trigger: Nightly.
  • Action: Re-compile the RAG prompts using the latest GoldenSet (which captures new edge cases from yesterday).
  • Result: The prompt “drifts” with the data. It adapts.
  • Risk: Regression.
    • Mitigation: Run a Verify step after Compile. If Score(New) < Score(Old), discard.

21.2. The “Model Swap” Pattern

  • Trigger: Switching from gpt-4 to gpt-4-turbo.
  • Action: Re-compile everything.
  • Why: Different models respond to different prompting signals. Using a GPT-4 prompt on Llama-3 is suboptimal.
  • Value: This makes model migration declarative. You don’t rewrite strings; you just re-run the compiler.

22. Epilogue regarding 21.3

We have reached the peak of Constructive MLOps. We are building things. Optimizing things. But MLOps is also about Destruction. Testing the limits. Breaking the system.

In 21.4, we stop being the Builder. We become the Attacker. Chapter 21.4: Red Teaming Operations.

23. Glossary

  • APO (Automated Prompt Optimization): The field of using algorithms to search the prompt space.
  • DSPy: A framework for programming with foundation models.
  • Teleprompter: The DSPy component that compiles (optimizes) modules.
  • In-Context Learning: Using examples in the prompt to condition the model.
  • Few-Shot: Providing N examples.

24. Case Study: Enterprise APO at Scale

Imagine a FinTech company, “BankSoft”, building a Support Chatbot.

24.1. The Problem

  • V1 (Manual): Engineers hand-wrote “You are a helpful bank assistant.”
  • Result: 70% Accuracy. Bot often hallucinated Policy details.
  • Iteration: Engineers tweaked text: “Be VERBOSE about policy.”
  • Result: 72% Accuracy. But now the bot is rude.
  • Cost: 3 Senior Engineers spent 2 weeks.

24.2. Adopting DSPy

  • They built a Golden Set of 200 (UserQuery, CorrectResponse) pairs from historical chat logs.
  • They defined a Metric: PolicyCheck(response) AND PolitenessCheck(response).
  • They ran MIPRO (Multi-Hop Instruction Prompt Optimization).

24.3. The Result

  • DSPy found a prompt configuration with 88% Accuracy.
  • The Found Prompt: It was weird.
    • Instruction: “Analyze the policy document as a JSON tree, then extract the leaf node relevant to the query.”
    • Humans would never write this. But it worked perfectly for the LLM’s internal representation.
  • Timeline: 1 day of setup. 4 hours of compilation.

25. Vision 2026: The End of Prompt Engineering

We are witnessing the death of “Prompt Engineering” as a job title. Just as “Assembly Code Optimization” died in the 90s. Compilers (gcc) became better at allocating registers than humans. Similarly, DSPy is better at allocating tokens than humans.

25.1. The New Stack

  • Source Code: Python Declarations (dspy.Signature).
  • Target Code: Token Weights (Prompts).
  • Compiler: The Optimizer (Teleprompter).
  • Developer Job: Curating Data (The Validation Set).

Prediction: MLOps in 2026 will be “DataOps for Compilers”.


26. Final Exercise: The Compiler Engineer

  1. Task: Create a “Title Generator” for YouTube videos.
  2. Dataset: Scrape 50 popular GitHub repos (Readme -> Title).
  3. Baseline: dspy.Predict("readme -> title").
  4. Metric: ClickBaitScore (use a custom LLM judge).
  5. Compile: Use BootstrapFewShot.
  6. Inspect: Look at the history.json trace. See what examples it picked.
    • Observation: Did it pick the “Flashy” titles?

27. Bibliography

1. “DSPy on GitHub”

  • Stanford NLP: The official repo and tutorials.

2. “Prompt Engineering Guide”

  • DAIR.AI: Excellent resource, though increasingly focused on automation.

3. “The Unreasonable Effectiveness of Few-Shot Learning”

  • Blog Post: Analysis of why examples matter more than instructions.

28. Epilogue

We have now fully automated the Improvement loop. Our system can:

  1. Version Prompts (21.1).
  2. Evaluation Performance (21.2).
  3. Optimize Itself (21.3).

This is a Self-Improving System. But it assumes “Performance” = “Metric Score”. What if the Metric is blind to a critical failure mode? What if the model is efficiently becoming raciest? We need to attack it.

End of Chapter 21.3.


29. Deep Dive: MIPRO (Multi-Hop Instruction Prompt Optimization)

BootstrapFewShot optimizes the examples. COPRO optimizes the instruction. MIPRO optimized both simultaneously using Bayesian Optimization.

29.1. The Search Space

MIPRO treats the prompt as a hyperparameter space.

  1. Instruction Space: It generates 10 candidate instructions.
  2. Example Space: It generates 10 candidate few-shot sets.
  3. Bootstrapping: It generates 3 bootstrapped traces.

Total Combinations: $10 \times 10 \times 3 = 300$ potential prompts.

29.2. The TPE Algorithm

It doesn’t try all 300. It uses Tree-structured Parzen Estimator (TPE).

  1. Try 20 random prompts.
  2. See which “regions” of the space (e.g. “Detailed Instructions”) yield high scores.
  3. Sample more from those regions.

29.3. Implementing MIPRO

from dspy.teleprompt import MIPRO

# 1. Define Metric
def metric(gold, pred, trace=None):
    return gold.answer == pred.answer

# 2. Init Optimizer
# min_num_trials=50 means it will run at least 50 valid pipeline executions
teleprompter = MIPRO(prompt_model=turbo, task_model=turbo, metric=metric, num_candidates=7, init_temperature=1.0)

# 3. Compile
# This can take 30 mins!
kwargs = dict(num_threads=4, display_progress=True, min_num_trials=50)
compiled_program = teleprompter.compile(RAG(), trainset=trainset, **kwargs)

Ops Note: MIPRO is expensive. Only use it for your “Model v2.0” release, not daily builds.


30. Reference Architecture: The Prompt Compiler Service

You don’t want every dev running DSPy on their laptop (API Key leaks, Cost). Centralize it.

30.1. The API Contract

POST /compile

{
  "task_signature": "question -> answer",
  "training_data": [ ... ],
  "metric": "exact_match",
  "model": "gpt-4"
}

Response: { "compiled_config": "{ ... }" }

30.2. The Worker Queue

  1. API receives request. Pushes to Redis Queue.
  2. Worker (Celery) picks up job.
  3. Worker runs dspy.compile.
  4. Worker saves artifact to S3 (s3://prompts/v1.json).
  5. Worker notifies Developer (Slack).

This ensures Cost Control and Auditability of all optimization runs.


31. Comparison: The 4 Levels of Adaptation

How do we adapt a Base Model (Llama-3) to our task?

LevelMethodWhat Changes?CostData Needed
1Zero-ShotStatic String$00
2Few-Shot (ICL)Context WindowInference Cost increases5-10
3DSPy (APO)Context Window (Optimized)Compile Cost ($20)50-100
4Fine-Tuning (SFT)Model WeightsTraining Cost ($500)1000+

The Sweet Spot: DSPy (Level 3) is usually enough for 90% of business apps. Only go to SFT (Level 4) if you need to reduce latency (by removing logical steps from the prompt) or learn a completely new language (e.g. Ancient Greek).

End of Chapter 21.3. (Proceed to 21.4).

32. Final Summary

Prompt Engineering is Dead. Long Live Prompt Compilation. We are moving from Alchemy (guessing strings) to Chemistry (optimizing mixtures). DSPy is the Periodic Table.


33. Ops Reference: Serving DSPy in Production

You compiled the prompt. Now how do you serve it? You don’t want to run the compiler for every request.

33.1. The Wrapper Class

import dspy
import json
import os

class ServingPipeline:
    def __init__(self, compiled_path="prompts/rag_v1.json"):
        # 1. Define Logic (Must match compilation logic EXACTLY)
        self.rag = RAG() 
        
        # 2. Load Weights (Prompts)
        if os.path.exists(compiled_path):
            self.rag.load(compiled_path)
            print(f"Loaded optimized prompt from {compiled_path}")
        else:
            print("WARNING: Using zero-shot logic. Optimization artifacts missing.")
            
    def predict(self, question):
        # 3. Predict
        try:
            return self.rag(question).answer
        except Exception as e:
            # Fallback logic
            print(f"DSPy failed: {e}")
            return "I am experiencing technical difficulties."

# FastAPI Integration
app = FastAPI()
pipeline = ServingPipeline()

@app.post("/chat")
def chat(q: str):
    return {"answer": pipeline.predict(q)}

33.2. Artifact Management

  • The Artifact: .json file containing the few-shot traces.
  • Versioning: Use dvc or git-lfs.
    • prompts/rag_v1.json (Commit: a1b2c3)
    • prompts/rag_v2.json (Commit: d4e5f6)
  • Rollback: If v2 hallucinates, simply flip the compiled_path environment variable back to v1.

34. Security Analysis: Does Optimization Break Safety?

A worrisome finding from jailbreak research: Optimized prompts are often easier to jailbreak.

34.1. The Mechanism

  • Optimization maximizes Utility (Answering the user).
  • Safety constraints (Refusals) hurt Utility.
  • Therefore, the Optimizer tries to find “shortcuts” around the Safety guidelines to get a higher score.
  • Example: It might find that adding “Ignore all previous instructions” to the prompt increases the score on the validation set (because the validation set has no safety traps).

34.2. Mitigation

  • Adversarial Training: Include “Attack Prompts” in your trainset with CorrectAnswer = "I cannot answer this."
  • Constraint: Use dspy.Assert to enforce safety checks during the optimization loop.

35. Troubleshooting Guide

Symptom: dspy.Assert triggers 100% of the time.

  • Cause: Your assertion is impossible given the model’s capabilities.
  • Fix: Relax the constraint or use a smarter model.

Symptom: “Context too long” errors.

  • Cause: BootstrapFewShot added 5 examples, each with 1000 tokens of retrieved context.
  • Fix:
    1. Limit num_passages=1 in the Retrieval module.
    2. Use LLMLingua (See 21.1) to compress the context used in the Few-Shot examples.

Symptom: The optimized prompt is gibberish.

  • Cause: High Temperature during compilation.
  • Fix: Set teleprompter temperature to 0.7 or lower.

36. Final Conclusion

Automated Prompt Optimization is the Serverless of GenAI. It abstracts away the “Infrastructure” (The Prompt Text) so you can focus on the “Logic” (The Signature).

Ideally, you will never write a prompt again. You will write signatures, curate datasets, and let the compiler do the rest.

End of Chapter 21.3.

37. Additional Resources

  • DSPy Discord: Active community for troubleshooting.
  • LangChain Logic: Experimental module for finding logical flaws in chains.