Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

21.1 Prompt Versioning: Git vs. Database

In the early days of LLMs, prompts were hardcoded strings in Python files: response = openai.Completion.create(prompt=f"Summarize {text}")

This is the “Magic String” Anti-Pattern. It leads to:

  1. No History: “Who changed the prompt yesterday? Why is the bot rude now?”
  2. No Rollbacks: “V2 is broken, how do I go back to V1?”
  3. Engineering Bottleneck: Product Managers want to iterate on text, but they need to file a Pull Request to change a Python string.

This chapter solves the Prompt Lifecycle Management problem.


1. The Core Debate: Code vs. Data

Is a prompt “Source Code” (Logic) or “Config” (Data)?

1.1. Strategy A: Prompts as Code (Git)

Treat prompts like functions. Store them in .yaml or .jinja2 files in the repo.

  • Pros:
    • Versioning is free (Git).
    • Code Review (PRs) is built-in.
    • CI/CD runs automatically on change.
  • Cons:
    • Non-technical people (Subject Matter Experts) cannot edit them easily.
    • Release velocity is tied to App deployment velocity.

1.2. Strategy B: Prompts as Data (Database/CMS)

Store prompts in a Postgres DB or a SaaS (PromptLayer, W&B). Fetch them at runtime.

  • Pros:
    • Decoupled Deployment: Update prompt without re-deploying the app.
    • UI for PMs/SMEs.
    • A/B Testing is easier (Traffic splitting features).
  • Cons:
    • Latency (Network call to fetch prompt).
    • “Production Surprise”: Someone changes the prompt in the UI, breaking the live app.

1.3. The Hybrid Consensus

“Git for Logic, DB for Content.”

  • Structure (Chain of Thought, Few-Shot Logic) stays in Git.
  • Wording (Tone, Style, Examples) lives in DB/CMS.
  • Or better: Sync Strategy. Edit in UI -> Commit to Git -> Deploy to DB.

2. Strategy A: The GitOps Workflow

If you choose Git (Recommended for Engineering-heavy teams).

2.1. File Structure

Organize by domain/model/version.

/prompts
  /customer_support
    /triage
      v1.yaml
      v2.yaml
      latest.yaml -> symlink to v2.yaml

2.2. The Prompts.yaml Standard

Do not use .txt. Use structured YAML to capture metadata.

id: support_triage_v2
version: 2.1.0
model: gpt-4-turbo
parameters:
  temperature: 0.0
  max_tokens: 500
input_variables: ["ticket_body", "user_tier"]
template: |
  You are a triage agent.
  User Tier: {{user_tier}}
  Ticket: {{ticket_body}}
  
  Classify urgency (High/Medium/Low).
tests:
  - inputs: { "ticket_body": "My server is on fire", "user_tier": "Free" }
    assert_contains: "High"

2.3. Loading in Python

Write a simple PromptLoader that caches these files.

import yaml
from jinja2 import Template

class PromptLoader:
    def __init__(self, prompt_dir="./prompts"):
        self.cache = {}
        self.load_all(prompt_dir)
        
    def get(self, prompt_id, **kwargs):
        p = self.cache[prompt_id]
        t = Template(p['template'])
        return t.render(**kwargs)

3. Strategy B: The Database Registry

If you need dynamic updates (e.g., A/B tests), you need a DB.

3.1. Schema Design (Postgres)

We need to support Immutability. Never UPDATE a prompt. Only INSERT.

CREATE TABLE prompt_definitions (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL, -- e.g. "checkout_flow"
    version INT NOT NULL,
    template TEXT NOT NULL,
    model_config JSONB, -- { "temp": 0.7 }
    created_at TIMESTAMP DEFAULT NOW(),
    author VARCHAR(100),
    is_active BOOLEAN DEFAULT FALSE,
    UNIQUE (name, version)
);

-- Index for fast lookup of "latest"
CREATE INDEX idx_prompt_name_ver ON prompt_definitions (name, version DESC);

3.2. Caching Layer (Redis)

You cannot hit Postgres on every LLM call. Latency.

  • Write Path: New Prompt -> Postgres -> Redis Pub/Sub -> App Instances clear cache.
  • Read Path: App Memory -> Redis -> Postgres.

3.3. The “Stale Prompt” Safety Mechanism

What if the DB is down?

  • Pattern: Bake the “Last Known Good” version into the Container Image as a fallback.
  • If reg.get("checkout") fails, load ./fallbacks/checkout.yaml.

4. Hands-On Lab: Building the Registry Client

Let’s build a production-grade Python client that handles Versioning and Fallbacks.

4.1. The Interface

class PromptRegistryClient:
    def get_prompt(self, name: str, version: str = "latest", tags: list = None) -> PromptObject:
        pass

4.2. Implementation

import redis
import json
import os

class Registry:
    def __init__(self, redis_url):
        self.redis = redis.from_url(redis_url)
        
    def get(self, name, version="latest"):
        cache_key = f"prompt:{name}:{version}"
        
        # 1. Try Cache
        data = self.redis.get(cache_key)
        if data:
            return json.loads(data)
            
        # 2. Try DB (Mocked here)
        # prompt = db.fetch(...)
        # if not prompt and version == "latest":
        #    raise FatalError("Prompt not found")
        
        # 3. Fallback to Local File
        try:
            with open(f"prompts/{name}.json") as f:
                print("⚠️ Serving local fallback")
                return json.load(f)
        except FileNotFoundError:
            raise Exception(f"Prompt {name} missing in DB and Disk.")
            
    def render(self, name, variables, version="latest"):
        p = self.get(name, version)
        return p['template'].format(**variables)

5. Migration Strategy: Git to DB

How do you move a team from Git files to a DB Registry?

5.1. The Deployment Hook

Do not make devs manually insert SQL. Add a step in CI/CD (GitHub Actions).

  1. Developer: Edits prompts/login.yaml. Pushes to Git.
  2. CI/CD:
    • Parses YAML.
    • Checks if content differs from “latest” in DB.
    • If changed, INSERT INTO prompts ... (New Version).
    • Tags it sha-123.

This gives you the Best of Both Worlds:

  • Git History for Blame/Review.
  • DB for dynamic serving and tracking.

6. A/B Testing Prompts

The main reason to use a DB is traffic splitting. “Is the ‘Polite’ prompt better than the ‘Direct’ prompt?”

6.1. The Traffic Splitter

In the Registry, define a “Split Config”.

{
  "name": "checkout_flow",
  "strategies": [
    { "variant": "v12", "weight": 0.9 },
    { "variant": "v13", "weight": 0.1 }
  ]
}

6.2. Deterministic Hashing

Use the user_id to determine the variant. Do not use random(). If User A sees “Variant B” today, they must see “Variant B” tomorrow.

import hashlib

def get_variant(user_id, split_config):
    # Hash user_id to 0-100
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
    
    cumulative = 0
    for strat in split_config['strategies']:
        cumulative += strat['weight'] * 100
        if hash_val < cumulative:
            return strat['variant']
            
    return split_config['strategies'][0]['variant']

In the next section, we will discuss how to Evaluate these variants to decide if V13 is actually better than V12.


7. Unit Testing for Prompts

How do you “Test” a prompt? You can’t assert the exact output string because LLMs are probabilistic. But you can assert:

  1. Format: Is it valid JSON?
  2. Determinism: Does the template render correctly?
  3. Safety: Does it leak PII?

7.1. Rendering Tests

Before sending to OpenAI, test the Jinja2 template.

def test_prompt_rendering():
    # Ensure no {{variable}} is left unplaced
    template = "Hello {{name}}"
    
    # Bad case
    try:
        render(template, {}) # Missing 'name'
    except TemplateError:
        print("Pass")

Ops Rule: Your CI pipeline must fail if a prompt variable is renamed in Python but not in the YAML.

7.2. Assertions (The “Vibe Check” Automator)

Use a library like pytest combined with lightweight LLM checks.

# test_prompts.py
import pytest
from llm_client import run_llm

@pytest.mark.parametrize("name", ["Alice", "Bob"])
def test_greeting_tone(name):
    prompt_template = load_prompt("greeting_v2")
    prompt = prompt_template.format(name=name)
    
    response = run_llm(prompt, temperature=0)
    
    # 1. Structure Check
    assert len(response) < 100
    
    # 2. Semantic Check (Simple)
    assert "Polite" in classify_tone(response)
    
    # 3. Negative Constraint
    assert "I hate you" not in response

8. Localization (I18N) for Prompts

If your app supports 20 languages, do you write 20 prompts? No. Strategy: English Logic, Localized Content.

8.1. The “English-First” Pattern

LLMs are best at reasoning in English. Even if the user asks in Japanese. Flow:

  1. User (JP): “Konnichiwa…”
  2. App: Translate to English.
  3. LLM (EN): Reason about the query. Generate English response.
  4. App: Translate to Japanese.
  • Pros: Consistent logic. Easier debugging.
  • Cons: Latency (2x translation steps). Loss of nuance.

8.2. The “Native Template” Pattern

Use Jinja2 to swap languages.

# customer_service.yaml
variants:
  en: "You are a helpful assistant."
  es: "Eres un asistente útil."
  fr: "Vous êtes un assistant utile."
def get_prompt(prompt_id, lang="en"):
    p = registry.get(prompt_id)
    template = p['variants'].get(lang, p['variants']['en']) # Fallback to EN
    return template

Ops Challenge: Maintaining feature parity. If you update English v2 to include “Ask for email”, you must update es and fr. Tool: Use GPT-4 to auto-translate the diffs in your CI/CD pipeline.


9. Semantic Versioning for Prompts

What is a v1.0.0 vs v2.0.0 prompt change?

9.1. MAJOR (Breaking)

  • Changing input_variables. (e.g., removing {user_name}).
    • Why: Breaks the Python code calling .format().
  • Changing Output Format. (e.g., JSON -> XML).
    • Why: Breaks the response parser.

9.2. MINOR (Feature)

  • Adding a new Few-Shot example.
  • Changing the System Instruction significantly (“Be rude” -> “Be polite”).
  • Why: Logic changes, but code signatures remain compatible.

9.3. PATCH (Tweak)

  • Fixing a typo.
  • Changing whitespace.

Ops Rule: Enforce SemVer in your Registry. A MAJOR change must trigger a new deployment of the App Code. MINOR and PATCH can be hot-swapped via DB.


A production-ready ORM definition for the Registry.

from sqlalchemy import Column, Integer, String, JSON, DateTime, UniqueConstraint
from sqlalchemy.orm import declarative_base
from datetime import datetime

Base = declarative_base()

class PromptVersion(Base):
    __tablename__ = 'prompt_versions'
    
    id = Column(Integer, primary_key=True)
    name = Column(String(255), index=True)
    version = Column(Integer)
    
    # content
    template = Column(String) # The Jinja string
    input_variables = Column(JSON) # ["var1", "var2"]
    
    # metadata
    model_settings = Column(JSON) # {"temp": 0.7, "stop": ["\n"]}
    tags = Column(JSON) # ["prod", "experiment-A"]
    
    created_at = Column(DateTime, default=datetime.utcnow)
    
    __table_args__ = (
        UniqueConstraint('name', 'version', name='_name_version_uc'),
    )

    def to_langchain(self):
        from langchain.prompts import PromptTemplate
        return PromptTemplate(
            template=self.template,
            input_variables=self.input_variables
        )

Usage with FastAPI:

@app.post("/prompts/render")
def render_prompt(req: RenderRequest, db: Session = Depends(get_db)):
    # 1. Fetch
    prompt = db.query(PromptVersion).filter_by(
        name=req.name, 
        version=req.version
    ).first()
    
    # 2. Validate Inputs
    missing = set(prompt.input_variables) - set(req.variables.keys())
    if missing:
        raise HTTPException(400, f"Missing variables: {missing}")
        
    # 3. Render
    return {"text": prompt.template.format(**req.variables)}

11. Cost Ops: Prompt Compression

Managing prompts is also about managing Length. If you verify a prompt v1 that is 4000 tokens, and v2 is 8000 tokens, you just doubled your cloud bill.

11.1. Compression Strategies

  1. Stop Words Removal: “The”, “A”, “Is”. (Low impact).
  2. Summarization: Use a cheap model (GPT-3.5) to summarize the History context before feeding it to GPT-4.
  3. LLMLingua: A structured compression method (Microsoft).
    • Uses a small language model (LLaMA-7B) to calculate the perplexity of each token.
    • Removes tokens with low perplexity (low information density).
    • Result: 20x compression with minimal accuracy loss.

11.2. Implementation

# pip install llmlingua
from llmlingua import PromptCompressor

compressor = PromptCompressor()
original_prompt = "..." # Long context

compressed = compressor.compress_prompt(
    original_prompt,
    instruction="Summarize this",
    question="What is X?",
    target_token=500
)

print(f"Compressed from {len(original_prompt)} to {len(compressed['compressed_prompt'])}")
# Send compressed['compressed_prompt'] to GPT-4

12. Comparison: Template Engines

Which syntax should your Registry use?

EngineSyntaxProsConsVerdict
f-strings{var}Python Native. Fast. Zero deps.Security Risk. Arbitrary code execution if using eval. No logic loops.Good for prototypes.
Mustache{{var}}Logic-less. Multi-language support (JS, Go, Py).No if/else logic. Hard to handle complex few-shot lists.Good for cross-platform.
Jinja2{% if x %}Powerful logic. Loops. Filters.Python specific.The Industry Standard.
LangChain{var}Built-in to framework.Proprietary syntax quirks.Use if using LangChain.

13. Glossary of Prompt Management

  • Prompt Registry: A centralized database to store, version, and fetch prompts.
  • System Prompt: The initial instruction ("You are a helpful assistant") that sets the behavior.
  • Zero-Shot: Asking for a completion without examples.
  • Few-Shot: providing examples (input -> output) in the context.
  • Jinja2: The templating engine used to inject variables into prompts.
  • Prompt Injection: A security exploit where user input overrides system instructions.
  • Token: The atomic unit of cost.
  • Context Window: The maximum memory of the model (e.g. 128k tokens).

14. Bibliography

1. “Jinja2 Documentation”

  • Pallets Projects: The reference for templating syntax.

2. “LLMLingua: Compressing Prompts for Accelerated Inference”

  • Jiang et al. (Microsoft) (2023): The paper on token dropping optimization.

3. “The Art of Prompt Engineering”

  • OpenAI Cookbook: Getting started guide.

15. Final Checklist: The “PromptOps” Maturity Model

How mature is your organization?

  • Level 0 (Chaos): Hardcoded string literals in Python code.
  • Level 1 (Structured): Prompts in prompts.py file constants.
  • Level 2 (GitOps): Prompts in generic .yaml files in Git.
  • Level 3 (Registry): Database-backed registry with a UI/CMS.
  • Level 4 (Automated): A/B testing framework automatically promoting the winner.

End of Chapter 21.1.


16. Deep Dive: The Hybrid Git+DB Architecture

We said “Git for Logic, DB for Content”. How do you build that?

16.1. The Sync Script

We need a script that runs on CI/CD deploy. It reads @/prompts and Upserts to Postgres.

# sync_prompts.py
import yaml
import hashlib
from sqlalchemy.orm import Session
from database import Engine, PromptVersion

def calculate_hash(content):
    return hashlib.sha256(content.encode()).hexdigest()

def sync(directory):
    session = Session(Engine)
    
    for file in os.listdir(directory):
        if not file.endswith(".yaml"): continue
        
        with open(file) as f:
            data = yaml.safe_load(f)
            
        content_hash = calculate_hash(data['template'])
        
        # Check redundancy
        existing = session.query(PromptVersion).filter_by(
            name=data['id'], 
            hash=content_hash
        ).first()
        
        if existing:
            print(f"Skipping {data['id']} (No change)")
            continue
            
        # Create new version
        latest = session.query(PromptVersion).filter_by(name=data['id']).order_by(PromptVersion.version.desc()).first()
        new_ver = (latest.version + 1) if latest else 1
        
        pv = PromptVersion(
            name=data['id'],
            version=new_ver,
            template=data['template'],
            hash=content_hash,
            author="system (git)"
        )
        session.add(pv)
        print(f"Deployed {data['id']} v{new_ver}")
        
    session.commit()

16.2. The UI Overlay

The “Admin Panel” reads from DB. If a PM edits a prompt in the Admin Panel:

  1. We save a new version v2.1 (draft) in the DB.
  2. We allow them to “Test” it in the UI.
  3. We do not promote it to latest automatically.
  4. Option A: The UI generates a Pull Request via GitHub API to update the YAML file.
  5. Option B: The UI updates DB, and the App uses DB. Git becomes “Backup”.
    • Recommendation: Option A (Git as Truth).

17. Operational Efficiency: Semantic Caching

If two users ask the same thing, we pay twice. Exact match caching (“Hello” vs “Hello “) fails. Semantic Caching saves money.

17.1. Architecture

  1. User Query: “How do I reset password?”
  2. Embed: [0.1, 0.2, ...]
  3. Vector Search (Redis VSS): Find neighbors.
  4. Found: “Reset my pass” (Distance 0.1).
  5. Action: Return cached answer.

17.2. Implementation with GPTCache

GPTCache is the standard library for this.

from gptcache import cache
from gptcache.manager import CacheBase, VectorBase
from gptcache.embedding import Onnx
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

# 1. Init
onnx = Onnx()
cache.init(
    pre_embedding_func=onnx.to_embeddings,
    embedding_func=onnx.to_embeddings,
    data_manager=CacheBase("sqlite"),
    vector_manager=VectorBase("faiss", dimension=onnx.dimension),
    similarity_evaluation=SearchDistanceEvaluation(),
    similarity_threshold=0.9, # Strict match
)

# 2. Patch OpenAI
import openai
def cached_completion(prompt):
    return cache.chat(
        openai.ChatCompletion.create,
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )

# 3. Validation
# First call: 3000ms (API)
# Second call: 10ms (Cache)

17.3. The Cache Invalidation Problem

If you update the Prompt Template (v1 -> v2), all cache entries are invalid. Ops Rule: Cache Key must include prompt_version_hash. Key = Embed(UserQuery) + Hash(SystemPrompt).


18. Governance: RBAC for Prompts

Who controls the brain of the AI?

18.1. Roles

  1. Developer: Full access to Code and YAML.
  2. Product Manager: Can Edit Content in UI. Cannot deploy to Prod without approval.
  3. Legal/Compliance: Read-Only. Can flag prompts as “Unsafe”.
  4. System: CI/CD bot.

18.2. Approval Workflow

Implementing “Prompt Review” gates.

  • Trigger: Any change to prompts/legal/*.yaml.
  • Gate: CI fails unless CODEOWNERS (@legal-team) approves PR.
  • Why: You don’t want a dev accidentally changing the liability waiver.

19. Case Study: Migration from “Magic Strings”

You joined a startup. They have 50 files with f"Translate {x}". How do you fix this?

Phase 1: Discovery (Grep)

Run grep -r "openai.Chat" . Inventory clearly shows 32 calls.

Phase 2: Refactor (The “Proxy”)

Create registry.py with a simple mapping. Don’t move to DB yet. Just move strings to one file.

# prompts.py
PROMPTS = {
    "translation": "Translate {x}",
    "summary": "Summarize {x}"
}

# In app code, replace literal with:
# prompt = PROMPTS["translation"].format(x=...)

Phase 3: Externalize (YAML)

Move dictionary to prompts.yaml. Ops Team can now see them.

Phase 4: Instrumentation (W&B)

Add W&B Tracing. Discover that “Summary” fails 20% of the time.

Phase 5: Optimization

Now you can iterate on “Summary” in the YAML without touching the App Code. Result: You lowered error rate to 5%. Value: You proved MLOps ROI.


A script to hunt down magic strings and propose a refactor.

import ast
import os

class PromptHunter(ast.NodeVisitor):
    def visit_Call(self, node):
        # Look for openai.ChatCompletion.create
        if isinstance(node.func, ast.Attribute) and node.func.attr == 'create':
            print(f"Found OpenAI call at line {node.lineno}")
            # Analyze arguments for 'messages'
            for keyword in node.keywords:
                if keyword.arg == 'messages':
                    print("  Arguments found. Manual review needed.")

def scan(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".py"):
                with open(os.path.join(root, file)) as f:
                    try:
                        tree = ast.parse(f.read())
                        print(f"Scanning {file}...")
                        PromptHunter().visit(tree)
                    except:
                        pass

if __name__ == "__main__":
    scan("./src")

21. Summary

We have built a System of Record for our prompts. No more magic strings. No more “Who changed that?”. No more deploying code to fix a typo.

We have Versioned, Tested, and Localized our probabilistic logic. Now, we need to know if our prompts are any good. Metrics like “Accuracy” are fuzzy in GenAI. In the next chapter, we build the Evaluation Framework (21.2).


22. Architecture Patterns: The Prompt Middleware

We don’t just call the registry. We often need “Interceptors”.

22.1. The Chain of Responsibility Pattern

A request goes through layers:

  1. Auth Layer: Checks JWT.
  2. Rate Limit Layer: Checks Redis quota.
  3. Prompt Layer: Fetches template from Registry.
  4. Guardrail Layer: Scans input for Injection.
  5. Cache Layer: Checks semantic cache.
  6. Model Layer: Calls Azure/OpenAI.
  7. Audit Layer: Logs result to Data Lake.

Code Skeleton:

class Middleware:
    def process(self, req): 
        # pre-hook
        resp = self.next.process(req)
        # post-hook
        return resp

class PromptMiddleware(Middleware):
    def process(self, req):
        prompt = registry.get(req.prompt_id)
        req.rendered_text = prompt.format(**req.vars)
        return self.next.process(req)

22.2. The Circuit Breaker Pattern

If OpenAI is down, or latency > 5s.

  • State: Closed (Normal), Open (Failing), Half-Open (Testing).
  • Fallback: If State == Open, switch to Azure or Llama-Local.
  • Registry Implication: Your Registry must store multiple model configs for the same prompt ID.
    • v1 (Primary): gpt-4
    • v1 (Fallback): gpt-3.5-turbo

23. The Tooling Landscape: Build vs. Buy

You can build the Registry (as we did), or buy it.

23.1. General Purpose (Encouraged)

  • Weights & Biases (W&B):
    • Pros: You likely already use it for Training. “Prompts” are just artifacts. Good visualization.
    • Cons: Not a real-time serving latency SLA. Use for Logging, not Serving.
  • MLflow:
    • Pros: Open Source. “AI Gateway” feature.
    • Cons: Java/Heavy.

23.2. Specialized PromptOps (Niche)

  • LangSmith:
    • Pros: Essential if using LangChain. “Playground” is excellent.
    • Cons: Vendor lock-in risk.
  • Helicone:
    • Pros: Focus on Caching and Analytics. “Proxy” architecture (change 1 line of URL).
    • Cons: Smaller ecosystem.
  • PromptLayer:
    • Pros: Great visual CMS for PMs.
    • Cons: Another SaaS bill.

Verdict:

  • Start with Git + W&B (Logging).
  • Move to Postgres + Redis (Serving) when you hit 10k users.
  • Use Helicone if you purely want Caching/Monitoring without build effort.

24. Comparison: Configuration Formats

We used YAML. Why not JSON?

FormatReadabilityComments?Multi-line Strings?Verdict
JSONLow (Quotes everywhere)NoNo (Need \n)Bad. Hard for humans to write prompts in.
YAMLHighYesYes (Using ``)
TOMLHighYesYes (Using """)Good. popular in Rust/Python config.
PythonMediumYesYesOkay, but dangerous (Arbitrary execution).

Why YAML Wins: The | block operator.

template: |
  You are a helpful assistant.
  You answer in haikus.

This preserves newlines perfectly without ugly \n characters.


25. Final Ops Checklist: The “Prompt Freeze”

Before Black Friday (or Launch Day):

  1. Registry Lock: Revoke “Write” access to the Registry for all non-Admins.
  2. Cache Warmup: Run a script to populate Redis with the top 1000 queries.
  3. Fallback Verification: Kill the OpenAI connection and ensure the app switches to Azure (or error handles gracefully).
  4. Token Budget: Verify current burn rate projected against traffic spike.
  5. Latency Budget: Verify P99 is under 2s.

End of Chapter 21.1. (Proceed to 21.2 for Evaluation Frameworks).


A production-grade implementation you can copy-paste.

from typing import List, Optional, Dict, Any
from pydantic import BaseModel, Field, validator
from datetime import datetime
import yaml
import hashlib

# 1. Models
class PromptMetadata(BaseModel):
    author: str
    tags: List[str] = []
    created_at: datetime = Field(default_factory=datetime.utcnow)
    deprecated: bool = False

class ModelConfig(BaseModel):
    provider: str # "openai", "azure"
    model_name: str # "gpt-4"
    parameters: Dict[str, Any] = {} # {"temperature": 0.5}

class PromptVersion(BaseModel):
    id: str # "checkout_flow"
    version: int # 1, 2, 3
    template: str
    input_variables: List[str]
    config: ModelConfig
    metadata: PromptMetadata
    hash: Optional[str] = None
    
    @validator('template')
    def check_template_vars(cls, v, values):
        # Validate that variables in template match input_variables list
        # Simple string check (in reality use Jinja AST)
        inputs = values.get('input_variables', [])
        for i in inputs:
            token = f"{{{{{i}}}}}" # {{var}}
            if token not in v:
                raise ValueError(f"Variable {i} declared but not used in template")
        return v
    
    def calculate_hash(self):
        content = f"{self.template}{self.config.json()}"
        self.hash = hashlib.sha256(content.encode()).hexdigest()

# 2. Storage Interface
class RegistryStore:
    def save(self, prompt: PromptVersion):
        raise NotImplementedError
    def get(self, id: str, version: int = None) -> PromptVersion:
        raise NotImplementedError

# 3. File System Implementation
import os
class FileRegistry(RegistryStore):
    def __init__(self, root_dir="./prompts"):
        self.root = root_dir
        os.makedirs(root_dir, exist_ok=True)
        
    def save(self, prompt: PromptVersion):
        prompt.calculate_hash()
        path = f"{self.root}/{prompt.id}_v{prompt.version}.yaml"
        with open(path, 'w') as f:
            yaml.dump(prompt.dict(), f)
            
    def get(self, id: str, version: int = None) -> PromptVersion:
        if version is None:
            # Find latest
            files = [f for f in os.listdir(self.root) if f.startswith(f"{id}_v")]
            if not files:
                raise FileNotFoundError
            # Sort by version number
            version = max([int(f.split('_v')[1].split('.yaml')[0]) for f in files])
            
        path = f"{self.root}/{id}_v{version}.yaml"
        with open(path) as f:
            data = yaml.safe_load(f)
            return PromptVersion(**data)

# 4. Usage
if __name__ == "__main__":
    # Create
    p = PromptVersion(
        id="summarize",
        version=1,
        template="Summarize this: {{text}}",
        input_variables=["text"],
        config=ModelConfig(provider="openai", model_name="gpt-3.5"),
        metadata=PromptMetadata(author="alex")
    )
    
    reg = FileRegistry()
    reg.save(p)
    print("Saved.")
    
    # Load
    p2 = reg.get("summarize")
    print(f"Loaded v{p2.version}: {p2.config.model_name}")

27. Future Architecture: The Prompt Compiler

In 2025, we won’t write prompts. We will write Intent. DSPy (Declarative Self-improving Language Programs) is leading this.

  • You write: Maximize(Accuracy).
  • Compiler: Automatically tries 50 variations of the prompt (“Think step by step”, “Act as expert”) and selects the best one based on your validation set.
  • Ops: The “Prompt Registry” becomes a “Program Registry”. The artifacts are optimized weights/instructions, not human-readable text.
  • Constraint: Requires a labeled validation set (Golden Data).

28. Epilogue

Chapter 21.1 has transformed the “Magic String” into a “Managed Artifact”. But a managed artifact is useless if it’s bad. How do we know if v2 is better than v1? We cannot just “eyeball” it. We need Metrics. Proceed to Chapter 21.2: Evaluation Frameworks.

.

  • The Pragmatic Programmer: For the ‘Don’t Repeat Yourself’ (DRY) principle applied to prompts.
  • Site Reliability Engineering (Google): For the ‘Error Budget’ concept applied to hallucinations.
  • LangChain Handbook (Pinecone): Excellent patterns for prompt management.