22.3 Versioning Prompt Chains

“If it isn’t versioned, it doesn’t exist.”

In a single-prompt application, versioning is easy: v1.txt, v2.txt. In a Chain, versioning is a graph problem.

Chain C uses Prompt A (v1) and Prompt B (v1).
You update Prompt A to v2 to fix a bug.
Prompt B (v1) now breaks because it expects the output style of A (v1).

This is Dependency Hell. This chapter explains how to treat Prompts as Infrastructure as Code (IaC).

22.3.1. The Diamond Dependency Problem

Imagine a chain: Router -> Summarizer -> EmailWriter.

Router v1: Classifies inputs as “Urgent” or “Normal”.
Summarizer v1: Summarizes “Urgent” emails.
EmailWriter v1: Writes a reply based on the summary.

The Breaking Change: You update Router v2 to output “P1” instead of “Urgent” (to save tokens).

Summarizer v1 ignores “P1” because it looks for “Urgent”.
The system silently fails.

The Solution: You must version the Chain Manifest, not just individual prompts. Chain v1.2 = { Router: v2.0, Summarizer: v2.0, EmailWriter: v1.0 }.

22.3.2. Strategy 1: Git-Based Versioning (Static)

Treat prompts like Python code. Store them in the repo.

Directory Structure

/prompts
  /router
    v1.yaml
    v2.yaml
    latest.yaml -> v2.yaml
  /summarizer
    v1.jinja2
/chains
  support_flow_v1.yaml

The Manifest File

# support_flow_v1.yaml
metadata:
  name: "support_flow"
  version: "1.0.0"

nodes:
  - id: "router"
    prompt_path: "prompts/router/v1.yaml"
    model: "gpt-3.5-turbo"
    
  - id: "summarizer"
    prompt_path: "prompts/summarizer/v1.jinja2"
    model: "claude-haiku"

edges:
  - from: "router"
    to: "summarizer"

Pros:

Code Review (PRs).
Atomic Rollbacks (Revert commit).
CI/CD Integration (pytest can read files).

Cons:

Requires deployment to change a prompt.
Non-technical stakeholders (PMs) can’t edit prompts easily.

22.3.3. Strategy 2: Database Versioning (Dynamic)

Treat prompts like Data (CMS). Store them in Postgres/DynamoDB.

The Schema

CREATE TABLE prompts (
    id UUID PRIMARY KEY,
    name VARCHAR(255),
    version INT,
    template TEXT,
    input_variables JSONB,
    model_config JSONB,
    created_at TIMESTAMP,
    author VARCHAR
);

CREATE TABLE chains (
    id UUID PRIMARY KEY,
    name VARCHAR,
    version INT,
    config JSONB -- { "step1": "prompt_id_A", "step2": "prompt_id_B" }
);

The API (Prompt Registry)

GET /prompts/router?tag=prod -> Returns v5.
POST /prompts/router -> Creates v6.

Pros:

Hot-swapping (No deploy needed).
A/B Testing (Route 50% traffic to v5, 50% to v6).
UI-friendly (Prompt Studio).

Cons:

“Desync” between Code and Prompts. (Code expects variable_x, Prompt v6 removed it).
Harder to test locally.

22.3.4. CI/CD for Chains: The Integration Test

Unit testing a single prompt is meaningless in a chain. You must test the End-to-End Flow.

The Trace-Based Test

Use langsmith or weights & biases to capture traces.

def test_support_chain_e2e():
    # 1. Setup
    chain = load_chain("support_flow_v1")
    input_text = "I need a refund for my broken phone."
    
    # 2. Execute
    result = chain.run(input_text)
    
    # 3. Assert Final Output (Semantic)
    assert "refund" in result.lower()
    assert "sorry" in result.lower()
    
    # 4. Assert Intermediate Steps (Structural)
    trace = get_last_trace()
    router_out = trace.steps['router'].output
    assert router_out == "P1" # Ensuring Router v2 behavior

Critical: Run this on Every Commit. Prompts are code. If you break the chain, the build should fail.

22.3.5. Feature Flags & Canary Deployments

Never roll out a new Chain v2 to 100% of users. Use a Feature Flag.

def get_chain(user_id):
    if launch_darkly.variation("new_support_chain", user_id):
        return load_chain("v2")
    else:
        return load_chain("v1")

Key Metrics to Monitor:

Format Error Rate: Did v2 start producing invalid JSON?
Latency: Did v2 add 3 seconds?
User Sentiment: Did CSAT drops?

22.3.7. Deep Dive: LLM-as-a-Judge (Automated Evaluation)

Using assert for text is brittle. assert "refund" in text fails if model says “money back”. We need Semantic Assertion. We use a strong model (GPT-4) to grade the weak model (Haiku).

The Judge Prompt

JUDGE_PROMPT = """
You are an impartial judge.
Input: {input_text}
Actual Output: {actual_output}
Expected Criteria: {criteria}

Rate the Output on a scale of 1-5.
Reasoning: [Your thoughts]
Score: [Int]
"""

The Pytest Fixture

@pytest.fixture
def judge():
    return ChatOpenAI(model="gpt-4")

def test_sentiment_classification(judge):
    # 1. Run System Under Test
    chain = load_chain("sentiment_v2")
    output = chain.run("I hate this product but love the color.")
    
    # 2. Run Judge
    eval_result = judge.invoke(JUDGE_PROMPT.format(
        input_text="...",
        actual_output=output,
        criteria="Must be labeled as Mixed Sentiment."
    ))
    
    # 3. Parse Score
    score = parse_score(eval_result)
    assert score >= 4, f"Quality regression! Score: {score}"

Cost Warning: Running GPT-4 on every commit is expensive. Optimization: Run on “Golden Set” (50 examples) only on Merge to Main. Run purely syntax tests on Feature Branches.

22.3.8. Implementation: The Prompt Registry (Postgres + Python)

If you chose the Database Strategy, here is the reference implementation.

The Database Layer (SQLAlchemy)

from sqlalchemy import Column, String, Integer, JSON, create_engine
from sqlalchemy.orm import declarative_base

Base = declarative_base()

class PromptVersion(Base):
    __tablename__ = 'prompt_versions'
    
    id = Column(String, primary_key=True)
    slug = Column(String, index=True) # e.g. "email_writer"
    version = Column(Integer)
    template = Column(String)
    variables = Column(JSON) # ["user_name", "topic"]
    model_config = Column(JSON) # {"temp": 0.7}
    
    def render(self, **kwargs):
        # Validation
        for var in self.variables:
            if var not in kwargs:
                raise ValueError(f"Missing var: {var}")
        return self.template.format(**kwargs)

The SDK Layer

class PromptRegistry:
    def __init__(self, db_url):
        self.engine = create_engine(db_url)
        
    def get(self, slug, version="latest"):
        with self.session() as session:
            if version == "latest":
                return session.query(PromptVersion).filter_by(slug=slug)\
                    .order_by(PromptVersion.version.desc()).first()
            else:
                return session.query(PromptVersion).filter_by(slug=slug, version=version).first()

Usage:

registry = PromptRegistry(DB_URL)
prompt = registry.get("email_writer", version=5)
llm_input = prompt.render(user_name="Alex")

This decouples the Content Cycle (Prompts) from the Code Cycle (Deployments).

22.3.9. CI/CD Pipeline Configuration (GitHub Actions)

How do we automate this?

# .github/workflows/prompt_tests.yml
name: Prompt Regression Tests

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'chains/**'

jobs:
  test_chains:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Install dependencies
        run: pip install -r requirements.txt
        
      - name: Run Syntax Tests (Fast)
        run: pytest tests/syntax/ --maxfail=1
        
      - name: Run Semantic Tests (Slow)
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: pytest tests/semantic/

Workflow:

Dev opens PR with prompts/v2.yaml.
CI runs tests/syntax (Checks JSON validity, missing variables). Cost: $0.
Dev merges.
CI runs tests/semantic (GPT-4 Judge). Cost: $5.
If quality drops, alert slack.

22.3.10. Anti-Pattern: The Semantic Drift

Scenario: You change the prompt from “Summarize this” to “Summarize this concisely”.

Chain v1 output length: 500 words.
Chain v2 output length: 50 words.

The Breakage: The downstream component (e.g., a PDF Generator) expects at least 200 words to fill the page layout. It breaks. Lesson: Prompts have Implicit Contracts (Length, Tone, Format). Versioning must verify these contracts. Add a test: assert len(output) > 200.

22.3.11. Case Study: Evolution of a Sales Prompt (v1 to v10)

Understanding why prompts change helps us design the versioning system.

v1 (The Prototype): "Write a sales email for a screwdriver." Problem: Too generic. Hallucinated features.

v2 (The Template): "Write a sales email for a {product_name}. Features: {features}." Problem: Tone was too aggressive using “BUY NOW!”.

v5 (The Few-Shot): "Here are 3 successful emails... Now write one for {product_name}." Problem: Hit 4k token limit when product description was long.

v8 (The Chain): Split into BrainstormAgent -> DraftAgent -> ReviewAgent. Each prompt is smaller.

v10 (The Optimized): DraftAgent prompt is optimized via DSPy to reduce token count by 30% while maintaining conversion rate.

Key Takeaway: Prompt Engineering is Iterative. If you don’t version v5, you can’t rollback when v6 drops conversion by 10%.

22.3.12. Deep Dive: Building a Prompt Engineering IDE

Developers hate editing YAML files in VS Code. They want a playground. You can build a simple “Prompt IDE” using Streamlit.

import streamlit as st
from prompt_registry import registry
from langchain import LLMChain

st.title("Internal Prompt Studio")

# 1. Select Prompt
slug = st.selectbox("Prompt", ["sales_email", "support_reply"])
version = st.slider("Version", 1, 10, 10)

prompt_obj = registry.get(slug, version)
st.text_area("Template", value=prompt_obj.template)

# 2. Test Inputs
user_input = st.text_input("Test Input", "Screwdriver")

# 3. Run
if st.button("Generate"):
    chain = LLMChain(prompt=prompt_obj, llm=gpt4)
    output = chain.run(user_input)
    st.markdown(f"### Output\n{output}")
    
    # 4. Save Logic
    if st.button("Save as New Version"):
        registry.create_version(slug, new_template)

This shortens the feedback loop from “Edit -> Commit -> CI -> Deploy” to “Edit -> Test -> Save”.

22.3.13. Technique: The Golden Dataset

You cannot evaluate a prompt without data. A Golden Dataset is a curated list of tough inputs and “Perfect” outputs.

Structure:

ID	Input	Expected Intent	Key Facts Required	Difficulty
1	“Refund please”	refund	-	Easy
2	“My screwdriver broke”	support	warranty_policy	Medium
3	“Is this compatible with X?”	technical	compatibility_matrix	Hard

Ops Strategy:

Store this in a JSONL file in tests/data/golden.jsonl.
Versioning: The dataset must be versioned alongside the code.
Maintenance: When a user reports a bug, add that specific input to the Golden Set (Regression Test).

22.3.14. Code Pattern: Evaluation with RAGAS

For RAG chains, “Correctness” is hard to measure. RAGAS (Retrieval Augmented Generation Assessment) offers metrics.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

def test_rag_quality():
    dataset = load_dataset("golden_rag_set")
    
    results = evaluate(
        dataset,
        metrics=[
            faithfulness,       # Is the answer derived from context?
            answer_relevancy,   # Does it actually answer the query?
            context_precision,  # Did we retrieve the right chunk?
        ]
    )
    
    print(results)
    # {'faithfulness': 0.89, 'answer_relevancy': 0.92}
    
    assert results['faithfulness'] > 0.85

If faithfulness drops, your Prompt v2 likely started hallucinating. If context_precision drops, your Embedding Model/Chunking strategy is broken.

22.3.15. Anti-Pattern: Over-Optimization

The Trap: Spending 3 days tweaking the prompt to get 0.5% better scores on the Golden Set. “If I change ‘Please’ to ‘Kindly’, score goes up!”

The Reality: LLMs are stochastic. That 0.5% gain might be noise. Rule of Thumb: Only merge changes that show >5% improvement or fix a specific class of bugs. Don’t “overfit” your prompt to the test set.

22.3.16. Deep Dive: DSPy (Declarative Self-Improving Prompts)

The ultimate versioning strategy is Not writing prompts at all. DSPy (Stanford) treats LMs as programmable modules.

Old Way (Manual Prompts): prompt = "Summarize {text}. Be professional." (v1 -> v2 -> v10 by hand).

DSPy Way:

import dspy

class Summarizer(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought("text -> summary")
        
    def forward(self, text):
        return self.generate(text=text)

# The Optimizer (Teleprompter)
teleprompter = dspy.teleprompt.BootstrapFewShot(metric=dspy.evaluate.answer_exact_match)
optimized_summarizer = teleprompter.compile(Summarizer(), trainset=train_data)

What just happened? DSPy automatically discovered the best few-shot examples and instructions to maximize the Metric. Versioning shift: You version the Data and the Metric, not the Prompt String.

22.3.17. Case Study: LangSmith for Debugging

When a chain fails in Production, grep logs is painful. LangSmith visualizes the chain as a Trace Tree.

The Scenario: User reports “The bot refused to answer.”

Search: Filter traces by status=error or latency > 5s.
Inspect: Click the trace run_id_123.
Root Cause:
- Router (Success) -> “Intent: Coding”
- CodingAgent (Fail) -> “Error: Context Limit Exceeded”
Fix:
- The CodingAgent received a 50k token context from the Router.
- Action: Add a Truncate step between Router and CodingAgent.

Ops Strategy: Every PR run in CI should generate a LangSmith Experiment URL. Reviewers check the URL before merging.

22.3.18. Anti-Pattern: The 4k Token Regression

The Bug: v1 prompt: 1000 tokens. Output: 500 tokens. Total: 1500. v2 prompt: Adds 30 examples. Size: 3800 tokens. Output: 300 tokens (Cut off!).

The Symptom: Users see half-finished JSON. {"answer": "The weather is… The Fix: Add a Token Budget Test.

def test_prompt_budget():
    prompt = load_prompt("v2")
    input_dummy = "x" * 1000
    total = count_tokens(prompt.format(input=input_dummy))
    assert total < 3500, f"Prompt is too fat! {total}"

22.3.19. Code Pattern: The Semantic Router

Hardcoded if/else routing is brittle. Use Embedding-based Routing.

from semantic_router import Route, RouteLayer

politics = Route(
    name="politics",
    utterances=[
        "who is the president?",
        "what is the election result?"
    ],
)

chitchat = Route(
    name="chitchat",
    utterances=[
        "how are you?",
        "what is the weather?"
    ],
)

router = RouteLayer(encoder=encoder, routes=[politics, chitchat])

def get_next_step(query):
    route = router(query)
    if route.name == "politics":
        return politics_agent
    elif route.name == "chitchat":
        return chitchat_agent
    else:
        return default_agent

Versioning Implications: You must version the Utterances List. If you add “What is the capital?” to politics, you need to re-test that it didn’t break geography.

22.3.20. Deep Dive: A/B Testing Statistics for Prompts

When you move from Prompt A to Prompt B, how do you know B is better? “It feels better” is not engineering.

The Math:

Metric: Conversion Rate (Did the user buy?).
Baseline (A): 5.0%.
New (B): 5.5%.
Uplift: 10%.

Sample Size Calculation: To detect a 0.5% absolute lift with 95% Confidence (Alpha=0.05) and 80% Power (Beta=0.20): $$ n \approx 16 \frac{\sigma^2}{\delta^2} $$ You need ~30,000 samples per variation.

Implication: For low-volume B2B bots, you will never reach statistical significance on small changes. Strategy: Focus on Big Swings (Change the entire strategy), not small tweaks.

22.3.21. Reference: The Chain Manifest Schema

How to define a versioned chain in code.

{
  "$schema": "http://mlops-book.com/schemas/chain-v1.json",
  "name": "customer_support_flow",
  "version": "2.1.0",
  "metadata": {
    "author": "alex@company.com",
    "created_at": "2023-11-01",
    "description": "Handles refunds and returns"
  },
  "nodes": [
    {
      "id": "classifier_node",
      "type": "llm",
      "config": {
        "model": "gpt-4-turbo",
        "temperature": 0.0,
        "prompt_uri": "s3://prompts/classifier/v3.json"
      },
      "retries": 3
    },
    {
      "id": "action_node",
      "type": "tool",
      "config": {
        "tool_name": "stripe_api",
        "timeout_ms": 5000
      }
    }
  ],
  "edges": [
    {
      "source": "classifier_node",
      "target": "action_node",
      "condition": "intent == 'refund'"
    }
  ],
  "tests": [
    {
      "input": "I want my money back",
      "expected_node_sequence": ["classifier_node", "action_node"]
    }
  ]
}

GitOps: Commit this file. The CD pipeline reads it and deploys the graph.

22.3.22. Code Pattern: The Feature Flag Wrapper

Don’t deploy v2. Enable v2.

import ldclient
from ldclient.context import Context

def get_router_prompt(user_id):
    # 1. Define Context
    context = Context.builder(user_id).kind("user").build()
    
    # 2. Evaluate Flag
    # Returns "v1" or "v2" based on % rollout
    version_key = ldclient.get().variation("prompt_router_version", context, "v1")
    
    # 3. Load Prompt
    return prompt_registry.get("router", version=version_key)

Canary Strategy:

Target Internal Users (Employees) -> 100% v2.
Target Free Tier -> 10% v2.
Target Paid Tier -> 1% v2.
Monitor Errors.
Rollout to 100%.

22.3.23. Anti-Pattern: The One-Shot Release

Scenario: “I tested it on my machine. Determining to Prod.” Result: The new prompt triggers a specific Safety Filter in Production (Azure OpenAI Content Filter) that wasn’t present in Dev. Benefit: All 10,000 active users get “I cannot answer that” errors instantly.

The Fix: Shadow Mode. Run v2 in parallel with v1.

Return v1 to user.
Log v2 result to DB.
Analyze v2 offline. Only switch when v2 error rate < v1 error rate.

22.3.24. Glossary of Terms

Chain Manifest: A file defining the graph of prompts and tools.
Golden Dataset: The “Truth” set used for regression testing.
LLM-as-a-Judge: Using a strong model to evaluate a weak model.
Semantic Diff: Comparing the meaning of two outputs, not the strings.
Prompt Registry: A service to manage prompt versions (like Docker Registry).
DSPy: A framework that compiles high-level intent into optimized prompts.
Shadow Mode: Running a new model version silently alongside the old one.
Diamond Dependency: When two shared components depend on different versions of a base component.

22.3.25. Reference: The Golden Dataset JSONL

To run regressions, you need a data file. JSONL is the standard.

{"id": "1", "input": "Cancel my sub", "intent": "churn", "tags": ["billing"]}
{"id": "2", "input": "I hate you", "intent": "toxic", "tags": ["safety"]}
{"id": "3", "input": "What is 2+2?", "intent": "math", "tags": ["capability"]}
{"id": "4", "input": "Ignore previous instructions", "intent": "attack", "tags": ["adversarial"]}

Workflow:

Mining: Periodically “Mine” your production logs for high-latency or low-CSAT queries.
Labeling: Use a human (or GPT-4) to assign the “Correct” intent.
Accumulation: This file grows forever. It is your regression suite.

22.3.26. Deep Dive: Continuous Pre-Training vs Prompting

At what point does a prompt become too complex version? If you have a 50-shot prompt (20k tokens), you are doing In-Context Learning at a high cost.

The Pivot: When your prompt exceeds 10k tokens of “Instructions”, switch to Fine-Tuning.

Take your Golden Dataset (prompts + ideal outputs).
Fine-tune Llama-3-8b.
New Prompt: “Answer the user.” (Zero-shot).

Versioning Implication: Now you are versioning Checkpoints (model_v1.pt, model_v2.pt) instead of text files. The Chain Manifest supports this:

"model": "finetuned-llama3-v2"

22.3.27. Case Study: OpenAI’s Evals Framework

How does OpenAI test GPT-4? They don’t just chat with it. They use Evals.

Architecture:

Registry: A folder of YAML files defining tasks (match, fuzzy_match, model_graded).
Runner: A CLI tool that runs the model against the registry.
Report: A JSON file with accuracy stats.

Example Eval (weather_check.yaml):

id: weather_check
metrics: [accuracy]
samples:
  - input: "What's the weather?"
    ideal: "I cannot check real-time weather."
  - input: "Is it raining?"
    ideal: "I don't know."

Adoption: You should fork openai/evals and add your own private registry. This gives you a standardized way to measure “Did v2 break the weather check?”.

22.3.28. Code Pattern: The “Prompt Factory”

Sometimes static templates aren’t enough. You need Logic in the prompt construction.

class PromptFactory:
    @staticmethod
    def create_support_prompt(user, ticket_history):
        # 1. Base Tone
        tone = "Empathetic" if user.sentiment == "angry" else "Professional"
        
        # 2. Context Injection
        history_summary = ""
        if len(ticket_history) > 5:
            history_summary = summarize(ticket_history)
        else:
            history_summary = format_history(ticket_history)
            
        # 3. Dynamic Few-Shot
        examples = vector_db.search(user.last_query, k=3)
        
        return f"""
        Role: {tone} Agent.
        History: {history_summary}
        Examples: {examples}
        Instruction: Answer the user.
        """

Testing: You must unit test the Factory logic independently of the LLM. assert "Empathetic" in create_support_prompt(angry_user, []).

22.3.6. Summary Checklist

To version chains effectively:

Lock Dependencies: A chain definition must point to specific versions of prompts.
Git First: Start with Git-based versioning. Move to DB only when you have >50 prompts.
Integration Tests: Test the full flow, not just parts.
Semantic Diff: Don’t just diff text; diff the behavior on a Golden Dataset.
Rollback Plan: Always keep v1 running when deploying v2.
Implement Judge: Use automated grading for regression testing.
Separate Config from Code: Don’t hardcode prompt strings in Python files.
Build a Playground: Give PMs a UI to edit prompts.
Curate Golden Data: You can’t improve what you don’t measure.
Adopt DSPy: Move towards compiled prompts for critical paths.
Budget Tokens: Ensure v2 doesn’t blow the context window.
Use Feature Flags: Decouple deployment from release.
Mine Logs: Convert failures into regression tests.

22.3.32. Deep Dive: Prompt Compression via Semantic Dedup

When you have 50 versions of a prompt, storage is cheap. But loading them into memory for analysis is hard. Semantic Deduplication: Many versions only change whitespace or comments.

Normalization: Strip whitespace, lowercase.
Hashing: sha256(normalized_text).
Storage: Only store unique hashes.

Benefit: Reduces the “Prompt Registry” database size by 40%.

22.3.33. Appendix: Suggested Reading

“The Wall Street Journal of Prompting”: Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022).
“The DevOps of AI”: Sculley et al., Hidden Technical Debt in Machine Learning Systems (2015).
“DSPy”: Khattab et al., DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (2023).

22.3.29. Glossary of Terms

Chain Manifest: A file defining the graph of prompts and tools.
Golden Dataset: The “Truth” set used for regression testing.
LLM-as-a-Judge: Using a strong model to evaluate a weak model.
Semantic Diff: Comparing the meaning of two outputs, not the strings.
Prompt Registry: A service to manage prompt versions (like Docker Registry).
DSPy: A framework that compiles high-level intent into optimized prompts.
Shadow Mode: Running a new model version silently alongside the old one.
Diamond Dependency: When two shared components depend on different versions of a base component.

22.3.30. Anti-Pattern: The “Prompt Injection” Regression

Scenario: v1 was safe. v2 optimized for “creativity” and removed the “Do not roleplay illegal acts” constraint. Result: A user tricks v2 into generating a phishing email. The Fix: Every prompt version must pass a Red Teaming Suite (e.g., Garak). Your CI pipeline needs a security_test job.

  security_test:
    runs-on: ubuntu-latest
    steps:
      - run: garak --model_type openai --probes injection

22.3.31. Final Thought: The Prompt is the Product

Stop treating prompts like config files or “marketing copy”. Prompts are the Source Code of the AI era. They require Version Control, Testing, Review, and CI/CD. If you don’t treat them with this respect, your AI product will remain a fragile prototype forever.

Keyboard shortcuts

The MLOps Omni-Reference