42.4. Observability for Non-Deterministic Agents

Status: Draft Version: 1.0.0 Tags: #Observability, #Tracing, #OpenTelemetry, #Rust, #LLM Author: MLOps Team

Why “Logs” are Dead for Agents
The Anatomy of a Trace: Chain, Span, Event
OpenTelemetry (OTEL) for LLMs
Rust Implementation: Distributed Agent Tracing
Measuring Hallucinations: The “Eval” Span
Feedback Loops: User Thumbs Up/Down
Infrastructure: ClickHouse for Traces
Troubleshooting: Debugging a Runaway Agent
Future Trends: Standardization (OpenLLMTelemetry)
MLOps Interview Questions
Glossary
Summary Checklist

Why “Logs” are Dead for Agents

In traditional Microservices, logs are linear. Request -> Process -> Response.

In Agents, logs are a Graph. Goal -> Thought 1 -> Action 1 -> Obs 1 -> Thought 2 -> Action 2 -> Obs 2. A single “Run” might trigger 50 LLM calls and 20 Tool calls. Grepping logs for “Error” is useless if you don’t know the Input Prompt that caused the error 10 steps ago.

We need Distributed Tracing. Tracing preserves the Causal Chain. If Action 2 failed, we can walk up the tree to see that Obs 1 returned “Access Denied”, which caused Thought 2 to panic.

The Anatomy of a Trace: Chain, Span, Event

Trace (Run): The entire execution session. ID: run-123.
Span (Step): A logical unit of work.
- Span: LLM Call (Duration: 2s, Cost: $0.01).
- Span: Tool Exec (Duration: 500ms).
Event (Log): Point-in-time info inside a span. “Retrying connection”.
Attributes: Metadata. model="gpt-4", temperature=0.7.

Visualization:

[ Trace: "Research Quantum Physics" ]
  |-- [ Span: Planner LLM ]
  |     `-- Attributes: { input: "Research Quantum..." }
  |-- [ Chain: ReAct Loop ]
        |-- [ Span: Thought 1 ]
        |-- [ Span: Action: Search(arXiv) ]
        |-- [ Span: Obs: Result 1, Result 2... ]
        |-- [ Span: Thought 2 ]

OpenTelemetry (OTEL) for LLMs

OTEL is the industry standard for tracing. We map LLM concepts to OTEL Spans.

span.kind: CLIENT (External API call).
llm.request.model: gpt-4-turbo.
llm.token_count.prompt: 150.
llm.token_count.completion: 50.

Rust Implementation: Distributed Agent Tracing

We use the opentelemetry crate to instrument our Agent.

Project Structure

agent-tracing/
├── Cargo.toml
└── src/
    └── lib.rs

Cargo.toml:

[package]
name = "agent-tracing"
version = "0.1.0"
edition = "2021"

[dependencies]
opentelemetry = "0.20"
opentelemetry-otlp = "0.13"
opentelemetry-semantic-conventions = "0.13"
tracing = "0.1"
tracing-opentelemetry = "0.20"
tracing-subscriber = "0.3"
tokio = { version = "1", features = ["full", "rt-multi-thread"] }
async-openai = "0.14"

src/lib.rs:

#![allow(unused)]
fn main() {
use opentelemetry::{global, trace::Tracer as _, KeyValue};
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_semantic_conventions::trace as semconv;
use tracing::{info, instrument, span, Level};
use tracing_opentelemetry::OpenTelemetrySpanExt;
use tracing_subscriber::layer::SubscriberExt;
use tracing_subscriber::util::SubscriberInitExt;

pub fn init_tracer(endpoint: &str) {
    // Basic OTLP Pipeline setup.
    // Exports traces via gRPC to a collector (e.g. Jaeger, Honeycomb, SigNoz).
    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint(endpoint),
        )
        .install_batch(opentelemetry::runtime::Tokio)
        .expect("io error");

    // Connect `tracing` crate to OpenTelemetry
    tracing_subscriber::registry()
        .with(tracing_subscriber::EnvFilter::from_default_env())
        .with(tracing_opentelemetry::layer().with_tracer(tracer))
        .try_init()
        .expect("tracing init failed");
}

pub struct Agent {
    model: String,
}

impl Agent {
    /// Instrument the LLM Call.
    /// Uses `tracing::instrument` macro to automatically create a span.
    #[instrument(skip(self), fields(llm.model = %self.model))]
    pub async fn llm_call(&self, prompt: &str) -> String {
        // Manually create a child span for finer granularity if needed
        let span = span!(Level::INFO, "llm_request");
        let _enter = span.enter();

        // Add Attributes specific to LLMs
        // These keys should follow OpenLLMTelemetry conventions
        span.record("llm.prompt_tokens", &100); // In real app, run tokenizer here
        
        info!("Sending request to OpenAI...");
        
        // Mock API Call
        tokio::time::sleep(std::time::Duration::from_millis(500)).await;
        
        let response = "The answer is 42";
        
        // Record output tokens
        span.record("llm.completion_tokens", &5);
        span.record("llm.finish_reason", &"stop");
        
        response.to_string()
    }

    /// Instrument the Chain.
    /// This span will be the Parent of `llm_call`.
    #[instrument(skip(self))]
    pub async fn run_chain(&self, input: &str) {
        info!("Starting Chain for input: {}", input);
        
        // This call happens INSIDE the `run_chain` span context.
        // The Tracer automatically links them.
        let thought = self.llm_call(input).await;
        info!("Agent Thought: {}", thought);
        
        // Use Thought to call Tool
        let tool_span = span!(Level::INFO, "tool_execution", tool.name = "calculator");
        let _guard = tool_span.enter();
        info!("Executing Calculator...");
        // ... tool logic ...
    }
}
}

Measuring Hallucinations: The “Eval” Span

Observability isn’t just latency. It’s Quality. We can run an “Eval” Span asynchronously after the trace.

Self-Check GPT:

Agent outputs trace.
Observer Agent (GPT-4) reads the trace.
Observer asks: “Did the Agent follow the User Instruction?”
Observer outputs score: 0.8.
We ingest this score as a metric linked to the trace_id.

Evaluating the Eval (Meta-Eval): How do we know the Observer is right? Cohen’s Kappa: Measure agreement between Human Labelers and LLM Labelers. If Kappa > 0.8, we trust the LLM.

Feedback Loops: User Thumbs Up/Down

The ultimate signal is the User. When a user clicks “Thumbs Down” on the UI:

Frontend sends API call POST /feedback { trace_id: "run-123", score: 0 }.
Backend updates the Trace in ClickHouse with feedback_score = 0.
Hinge Loss: We filter for traces with Score 0 to find “Gold Negative Examples” for fine-tuning.

Infrastructure: ClickHouse for Traces

Elasticsearch is too expensive for high-volume spans. ClickHouse (Columnar DB) is standard for Logs/Traces.

Schema:

CREATE TABLE traces (
    trace_id String,
    span_id String,
    parent_span_id String,
    name String,
    start_time DateTime64(9),
    duration_ms Float64,
    tags Map(String, String),
    prompt String,    -- Heavy data, compress with LZ4
    completion String -- Heavy data
) ENGINE = MergeTree()
ORDER BY (start_time, trace_id);

Tools:

LangFuse / LangSmith: SaaS wrapping ClickHouse/Postgres.
Arize Phoenix: Local OSS solution.

Troubleshooting: Debugging a Runaway Agent

Scenario 1: The Token Burner

Symptom: Bill spikes to $500/hour.
Observability: Group Traces by trace_id and Sum total_tokens.
Cause: One user triggered a loop that ran for 10,000 steps.
Fix: Alerting. IF sum(tokens) > 5000 AND duration < 5m THEN Kill.

Scenario 2: The Lost Span

Symptom: “Parent Span not found”.
Cause: Async Rust code dropped the tracing::Context.
Fix: Use .in_current_span() when spawning Tokio tasks to propagate the Context.

Scenario 3: The Trace Explosion (Sampling)

Symptom: Trace ingest costs > LLM costs.
Cause: You are tracing every “heartbeat” or “health check”.
Fix: Head-Based Sampling. Only trace 1% of successful requests.
Better Fix: Tail-Based Sampling. Buffer traces in memory. If Error, send 100%. If Success, send 1%.

Future Trends: Standardization (OpenLLMTelemetry)

Currently, every vendor (LangChain, LlamaIndex) has custom trace formats. OpenLLMTelemetry is a working group defining standard semantic conventions.

Standardizing context_retrieved vs chunk_retrieved.
Standardizing rag.relevance_score.

MLOps Interview Questions

Q: What is “High Cardinality” in tracing? A: Tags with infinite unique values (e.g., User ID, Prompt Text). Traditional metrics (Prometheus) die with high cardinality. Tracing systems (ClickHouse) handle it well.
Q: How do you obscure PII in traces? A: Middleware. Regex scan every prompt and completion for SSN/CreditCards. Replace with [REDACTED] before sending to the Trace Collector.
Q: Difference between “Spans” and “Attributes”? A: Span is time-bound (“Do work”). Attribute is key-value metadata attached to that work (“User=123”).
Q: Why sample traces? A: Cost. Storing 100% of LLM inputs/outputs is massive (Terabytes). Sample 100% of Errors, but only 1% of Successes.
Q: What is “Waterfall view”? A: A visualization where spans are shown as horizontal bars, indented by parent-child relationship. Critical for spotting serial vs parallel bottlenecks.

Glossary

OTEL: OpenTelemetry.
Span: A single unit of work (e.g., one DB query).
Trace: A tree of spans representing a request.
Cardinality: The number of unique values in a dataset.
Sampling: Storing only a subset of traces to save cost.

Summary Checklist

Tag Everything: Tag spans with environment (prod/dev) and version (git commit).
Propagate Context: Ensure traceparent headers are sent between microservices if the Agent calls external APIs.
Alert on Error Rate: If > 5% of spans are status=ERROR, wake up the on-call.
Monitor Latency P99: LLMs are slow. P99 Latency matters more than Average.
PII Scrubbing: Automate PII removal in the collector pipeline.

The MLOps Omni-Reference