Production Hardening

Observability and Tracing for Agents

13m read

Observability and Tracing for Production Agents

Why Observability Is Different for Agents

Traditional software observability — CPU, memory, request latency — is necessary but not sufficient for AI agents. An agent can be performant by all standard metrics and still produce wrong answers, use inefficient reasoning paths, or silently degrade when a model update changes output distributions.

Agent observability requires capturing a new kind of signal: the semantic trace — the full reasoning chain, every tool call with its inputs and outputs, token counts, latency at each step, and the model's internal deliberation. Without these signals, debugging a production agent failure is guesswork.


The Observability Stack for Agents

┌──────────────────────────────────────────────────────────────┐
│                   Agent Observability Stack                  │
│                                                              │
│  ┌─────────────────────┐   ┌─────────────────────────────┐  │
│  │   LangSmith          │   │  Phoenix / Arize            │  │
│  │  (LangChain native)  │   │  (model-agnostic)           │  │
│  │  - Trace UI          │   │  - Embedding drift          │  │
│  │  - Dataset mgmt      │   │  - Hallucination detection  │  │
│  │  - Eval integration  │   │  - Response quality         │  │
│  └─────────────────────┘   └─────────────────────────────┘  │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐    │
│  │              OpenTelemetry (OTEL)                    │    │
│  │  Standard spans/traces exportable to any backend:   │    │
│  │  Jaeger, Zipkin, Datadog, Honeycomb, Grafana Tempo  │    │
│  └──────────────────────────────────────────────────────┘    │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐    │
│  │         Structured Logging (JSON lines)              │    │
│  │  Agent-specific: LDD format, trajectory capture     │    │
│  └──────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────┘

Setting Up LangSmith Tracing

LangSmith is the most integrated option for LangChain and LangGraph agents. Every chain invocation, tool call, and LLM response is automatically captured.

Installation and Configuration

pip install langsmith langchain
import os

# Set environment variables before any LangChain imports
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_..."     # From smith.langchain.com
os.environ["LANGCHAIN_PROJECT"] = "my-agent-prod"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

# That's it — all LangChain/LangGraph calls are now traced automatically
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o")
# All calls to llm.invoke() are now captured in LangSmith

Adding Custom Metadata to Traces

from langsmith import traceable, Client
from langchain_core.tracers.context import tracing_v2_enabled


@traceable(name="process_user_request", tags=["production", "v2"])
def process_request(user_id: str, task: str) -> str:
    """
    Wrapping with @traceable captures this function's execution as a
    named span in LangSmith, with custom tags for filtering in the UI.
    """
    with tracing_v2_enabled(
        project_name="my-agent-prod",
        metadata={
            "user_id": user_id,
            "task_type": classify_task(task),
            "session_id": generate_session_id(),
        }
    ):
        return agent.invoke({"input": task})


# Manual trace creation for non-LangChain code
client = Client()

def run_custom_tool(query: str, run_id: str) -> str:
    """Manually log a custom span for a non-LangChain tool."""
    with client.tracing_context(
        run_id=run_id,
        name="custom_database_lookup",
        run_type="tool",
        inputs={"query": query},
        tags=["database", "rag"],
    ) as run:
        result = db.execute(query)
        run.end(outputs={"result": result, "row_count": len(result)})
        return result

Querying Traces Programmatically

from langsmith import Client
from datetime import datetime, timedelta

client = Client()

# Fetch recent failed runs
failed_runs = client.list_runs(
    project_name="my-agent-prod",
    filter='eq(error, true)',
    start_time=datetime.utcnow() - timedelta(hours=24),
    limit=50,
)

for run in failed_runs:
    print(f"Run {run.id}: {run.name}")
    print(f"  Error: {run.error}")
    print(f"  Inputs: {run.inputs}")
    print(f"  Duration: {run.end_time - run.start_time}")
    print()

# Fetch runs with high latency
slow_runs = client.list_runs(
    project_name="my-agent-prod",
    filter='gte(latency, 30)',  # 30+ seconds
    limit=20,
)

# Export to DataFrame for analysis
import pandas as pd
runs_data = [
    {
        "id": str(r.id),
        "name": r.name,
        "latency_s": (r.end_time - r.start_time).total_seconds()
        if r.end_time else None,
        "total_tokens": r.total_tokens,
        "error": r.error,
        "tags": r.tags,
    }
    for r in client.list_runs(project_name="my-agent-prod", limit=500)
]
df = pd.DataFrame(runs_data)
print(df.describe())

OpenTelemetry for Agents

For teams not on LangChain, or for integrating agent traces into an existing observability platform, OpenTelemetry (OTEL) is the standard.

Setup with OTEL

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# Configure OTEL tracer
resource = Resource.create({
    "service.name": "agent-service",
    "service.version": "2.1.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://jaeger:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("agent.tracer")

Creating Agent-Specific Spans

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import json


tracer = trace.get_tracer("agent.tracer")


def traced_tool_call(tool_name: str, tool_fn, **tool_inputs) -> dict:
    """
    Execute a tool call within an OTEL span.
    Captures inputs, outputs, and errors as span attributes.
    """
    with tracer.start_as_current_span(
        f"tool.{tool_name}",
        kind=trace.SpanKind.CLIENT,
    ) as span:
        # Record tool inputs as span attributes
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("tool.inputs", json.dumps(tool_inputs)[:1000])

        try:
            result = tool_fn(**tool_inputs)
            span.set_attribute("tool.output_length", len(str(result)))
            span.set_attribute("tool.success", True)
            span.set_status(Status(StatusCode.OK))
            return result
        except Exception as exc:
            span.set_attribute("tool.success", False)
            span.set_attribute("tool.error", str(exc))
            span.set_status(Status(StatusCode.ERROR, str(exc)))
            span.record_exception(exc)
            raise


def traced_llm_call(model: str, prompt: str, llm_fn) -> str:
    """Record an LLM call as an OTEL span with token tracking."""
    with tracer.start_as_current_span(
        "llm.completion",
        kind=trace.SpanKind.CLIENT,
    ) as span:
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.prompt_tokens", estimate_tokens(prompt))

        response = llm_fn(prompt)

        span.set_attribute("llm.completion_tokens", estimate_tokens(response))
        span.set_attribute("llm.total_tokens",
                           estimate_tokens(prompt) + estimate_tokens(response))
        return response


class AgentTracer:
    """
    High-level tracing wrapper for an entire agent run.
    Creates a root span and attaches all child spans (tool calls, LLM calls).
    """

    def __init__(self, agent_name: str, user_id: str | None = None):
        self.agent_name = agent_name
        self.user_id = user_id

    def __enter__(self):
        self.span = tracer.start_span(
            f"agent.{self.agent_name}.run",
            kind=trace.SpanKind.SERVER,
            attributes={
                "agent.name": self.agent_name,
                "user.id": self.user_id or "anonymous",
            },
        )
        self.ctx = trace.use_span(self.span, end_on_exit=False)
        self.ctx.__enter__()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if exc_val:
            self.span.set_status(Status(StatusCode.ERROR, str(exc_val)))
            self.span.record_exception(exc_val)
        else:
            self.span.set_status(Status(StatusCode.OK))
        self.span.end()
        return False   # Don't suppress exceptions

    def record_step(self, step_type: str, content: str, **metadata):
        """Add a structured event to the current agent span."""
        self.span.add_event(
            step_type,
            attributes={
                "content_preview": content[:500],
                **{k: str(v) for k, v in metadata.items()},
            },
        )


# Usage
with AgentTracer("customer-support", user_id="user_123") as tracer_ctx:
    tracer_ctx.record_step("task_received", task_input)
    result = run_agent(task_input)
    tracer_ctx.record_step("task_completed", result)

Phoenix / Arize for LLM Quality Monitoring

Phoenix (open-source, from Arize) adds semantic quality metrics on top of traces — hallucination detection, embedding drift, output toxicity.

import phoenix as px
from phoenix.trace.langchain import LangChainInstrumentor

# Launch Phoenix (runs locally on port 6006)
session = px.launch_app()

# Instrument LangChain automatically
LangChainInstrumentor().instrument()

# Now all LangChain calls appear in Phoenix UI at http://localhost:6006
# including:
#   - LLM input/output pairs
#   - Tool call traces
#   - Embedding distances for RAG retrieval
#   - Latency and token histograms

Phoenix query example to find hallucinated responses:

from phoenix.experimental.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    run_evals,
)

# Pull recent traces from Phoenix
traces_df = px.active_session().get_spans_dataframe()
llm_spans = traces_df[traces_df["span_kind"] == "LLM"].copy()

# Run hallucination eval on all LLM calls
eval_model = OpenAIModel(model="gpt-4o-mini")
hallucination_results = run_evals(
    dataframe=llm_spans,
    evaluators=[HallucinationEvaluator(eval_model)],
    provide_explanation=True,
)

# Flag suspicious spans
flagged = hallucination_results[hallucination_results["label"] == "hallucinated"]
print(f"Flagged {len(flagged)} potentially hallucinated responses")

Structured Logging for Agent Trajectories

For teams that prefer logs over distributed tracing, structured JSON logging combined with the LDD format provides excellent debuggability:

import logging
import json
import time
from typing import Any


class AgentStructuredLogger:
    """
    Emits structured JSON log lines for every agent action.
    Designed for ingestion by ELK, Loki, or CloudWatch.
    """

    def __init__(self, agent_id: str, run_id: str):
        self.agent_id = agent_id
        self.run_id = run_id
        self.logger = logging.getLogger("agent")
        self.step = 0

    def _emit(self, level: str, event: str, importance: int, **fields):
        self.step += 1
        log_line = {
            "ts": time.strftime("%Y-%m-%dT%H:%M:%S.000Z", time.gmtime()),
            "level": level,
            "agent_id": self.agent_id,
            "run_id": self.run_id,
            "step": self.step,
            "event": event,
            "importance": importance,
            **fields,
        }
        self.logger.info(json.dumps(log_line))

    def tool_call(self, tool: str, inputs: dict):
        self._emit("INFO", "tool_call", 7,
                   tool=tool, inputs=inputs)

    def tool_result(self, tool: str, result: Any, latency_ms: float):
        self._emit("INFO", "tool_result", 7,
                   tool=tool,
                   result_preview=str(result)[:300],
                   latency_ms=round(latency_ms, 2))

    def tool_error(self, tool: str, error: str, latency_ms: float):
        self._emit("ERROR", "tool_error", 9,
                   tool=tool, error=error, latency_ms=round(latency_ms, 2))

    def reasoning_step(self, thought: str):
        self._emit("DEBUG", "reasoning", 4, thought=thought[:500])

    def final_answer(self, answer: str, total_tokens: int):
        self._emit("INFO", "final_answer", 9,
                   answer_preview=answer[:300],
                   total_tokens=total_tokens)

Debugging Production Agent Failures

When a production agent fails, follow this playbook:

  1. Locate the trace in LangSmith or your OTEL backend using the run ID (always log this).
  2. Find the first error span — the root cause is usually the earliest failure, not the last.
  3. Inspect tool inputs at the error step — is the input malformed? Was context lost?
  4. Check token counts — did the context window fill up and truncate crucial information?
  5. Compare against a passing trace for the same task type — what did the agent do differently?

Tip: Add a run_id to every agent response so users can report it when something goes wrong. One line in your response header — Run ID: abc-123 — saves hours of debugging by giving you an exact trace to look at.


Summary

Production agent observability requires three layers: LangSmith or Phoenix for semantic trace capture (LLM inputs/outputs, tool calls, reasoning steps), OpenTelemetry for integration with your existing infrastructure, and structured logging for lightweight, searchable audit trails. The goal is that any production failure should be diagnosable in under 10 minutes — you should be able to replay the exact trace, identify the first wrong decision, and understand why it happened.