Evaluation and Benchmarking

Trajectory-Level Evaluation

14m read

Trajectory-Level Evaluation: Judging the Reasoning Path

Beyond Final Answers

Most evaluation frameworks ask a single question: did the agent get the right answer? This is necessary but not sufficient. An agent that gets the right answer via a broken reasoning path is brittle — it succeeded by luck and will fail on slightly different inputs.

Trajectory-level evaluation examines the entire path from task to answer: every thought, every tool call, every observation. This reveals whether the agent:

  • Used the minimum number of steps (efficiency)
  • Called the right tools in the right order (precision)
  • Avoided hallucinating tool outputs (faithfulness)
  • Recovered gracefully from intermediate errors (resilience)
  • Followed expected reasoning patterns (alignment with gold trajectories)

What Is a Trajectory?

A trajectory is the complete, ordered sequence of steps an agent took during a single task execution:

from dataclasses import dataclass, field
from typing import Literal, Any
from datetime import datetime


@dataclass
class TrajectoryStep:
    """A single step in an agent's execution trajectory."""
    step_id: int
    step_type: Literal["thought", "tool_call", "observation", "answer"]
    content: str                        # The thought text, tool input, or observation
    tool_name: str | None = None        # For tool_call steps
    tool_input: dict | None = None      # Parsed tool arguments
    tool_output: Any = None             # For observation steps
    timestamp: str = field(
        default_factory=lambda: datetime.utcnow().isoformat()
    )
    latency_ms: float = 0.0
    tokens_used: int = 0
    error: str | None = None            # If tool call raised an exception


@dataclass
class AgentTrajectory:
    """Complete execution record for one agent task."""
    task_id: str
    task_input: str
    steps: list[TrajectoryStep] = field(default_factory=list)
    final_answer: str | None = None
    succeeded: bool = False
    total_tokens: int = 0
    total_latency_ms: float = 0.0
    metadata: dict = field(default_factory=dict)

    def tool_calls(self) -> list[TrajectoryStep]:
        return [s for s in self.steps if s.step_type == "tool_call"]

    def unique_tools_used(self) -> set[str]:
        return {s.tool_name for s in self.tool_calls() if s.tool_name}

    def error_steps(self) -> list[TrajectoryStep]:
        return [s for s in self.steps if s.error is not None]

Logging Trajectories in Practice

To evaluate trajectories, you must first capture them. Here is a LangChain callback handler that logs every agent step:

from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult
import json
import time


class TrajectoryLogger(BaseCallbackHandler):
    """
    LangChain callback handler that captures full agent trajectories.
    Attach to an AgentExecutor to log every step automatically.
    """

    def __init__(self, task_id: str):
        self.task_id = task_id
        self.trajectory = AgentTrajectory(task_id=task_id, task_input="")
        self._step_counter = 0
        self._tool_start_time: float | None = None

    def _next_step_id(self) -> int:
        self._step_counter += 1
        return self._step_counter

    def on_agent_action(self, action, **kwargs):
        """Called when the agent decides to call a tool."""
        self._tool_start_time = time.monotonic()
        step = TrajectoryStep(
            step_id=self._next_step_id(),
            step_type="tool_call",
            content=str(action.tool_input),
            tool_name=action.tool,
            tool_input=action.tool_input
            if isinstance(action.tool_input, dict)
            else {"input": action.tool_input},
        )
        self.trajectory.steps.append(step)

    def on_tool_end(self, output: str, **kwargs):
        """Called when a tool returns its result."""
        latency = 0.0
        if self._tool_start_time is not None:
            latency = (time.monotonic() - self._tool_start_time) * 1000
            self._tool_start_time = None

        step = TrajectoryStep(
            step_id=self._next_step_id(),
            step_type="observation",
            content=output,
            tool_output=output,
            latency_ms=latency,
        )
        self.trajectory.steps.append(step)

    def on_tool_error(self, error: Exception, **kwargs):
        """Called when a tool raises an exception."""
        step = TrajectoryStep(
            step_id=self._next_step_id(),
            step_type="observation",
            content=f"ERROR: {error}",
            error=str(error),
        )
        self.trajectory.steps.append(step)

    def on_agent_finish(self, finish, **kwargs):
        """Called when the agent produces its final answer."""
        self.trajectory.final_answer = finish.return_values.get("output", "")
        step = TrajectoryStep(
            step_id=self._next_step_id(),
            step_type="answer",
            content=self.trajectory.final_answer,
        )
        self.trajectory.steps.append(step)

    def save(self, path: str):
        """Persist trajectory to JSON for later analysis."""
        import dataclasses
        with open(path, "w") as f:
            json.dump(dataclasses.asdict(self.trajectory), f, indent=2, default=str)

Usage with an AgentExecutor:

logger = TrajectoryLogger(task_id="task_001")
result = agent_executor.invoke(
    {"input": "What is the current price of AAPL?"},
    config={"callbacks": [logger]},
)
logger.save("trajectories/task_001.json")

Step Correctness Metrics

Step correctness measures how well each step aligns with what an expert would have done. You need a gold trajectory (produced by a human or a stronger reference agent) for comparison.

def step_precision(
    predicted: AgentTrajectory,
    gold: AgentTrajectory,
) -> float:
    """
    Fraction of the agent's tool calls that appear in the gold trajectory.
    Measures: did the agent avoid unnecessary tool calls?
    """
    pred_calls = [(s.tool_name, json.dumps(s.tool_input, sort_keys=True))
                  for s in predicted.tool_calls()]
    gold_calls = {(s.tool_name, json.dumps(s.tool_input, sort_keys=True))
                  for s in gold.tool_calls()}

    if not pred_calls:
        return 1.0 if not gold_calls else 0.0

    correct = sum(1 for call in pred_calls if call in gold_calls)
    return correct / len(pred_calls)


def step_recall(
    predicted: AgentTrajectory,
    gold: AgentTrajectory,
) -> float:
    """
    Fraction of gold tool calls that the agent also made.
    Measures: did the agent gather all necessary information?
    """
    pred_calls = {(s.tool_name, json.dumps(s.tool_input, sort_keys=True))
                  for s in predicted.tool_calls()}
    gold_calls = [(s.tool_name, json.dumps(s.tool_input, sort_keys=True))
                  for s in gold.tool_calls()]

    if not gold_calls:
        return 1.0

    correct = sum(1 for call in gold_calls if call in pred_calls)
    return correct / len(gold_calls)


def step_f1(predicted: AgentTrajectory, gold: AgentTrajectory) -> float:
    """Harmonic mean of step precision and recall."""
    p = step_precision(predicted, gold)
    r = step_recall(predicted, gold)
    if p + r == 0:
        return 0.0
    return 2 * p * r / (p + r)

Tool-Use Efficiency Scoring

Efficiency measures how economically the agent reached its goal. An agent that makes 10 tool calls to answer a question that requires only 2 is inefficient, even if it succeeds.

def tool_efficiency_score(trajectory: AgentTrajectory, gold: AgentTrajectory) -> float:
    """
    Score in [0, 1] representing how efficiently the agent used tools.
    1.0 = same number of calls as gold; lower = more redundant calls.
    """
    agent_calls = len(trajectory.tool_calls())
    gold_calls = len(gold.tool_calls())

    if gold_calls == 0:
        return 1.0 if agent_calls == 0 else 0.0

    # Penalise extra tool calls but cap the penalty
    ratio = gold_calls / max(agent_calls, gold_calls)
    return ratio


def redundancy_rate(trajectory: AgentTrajectory) -> float:
    """
    Fraction of tool calls that are exact duplicates of a previous call.
    A non-zero value indicates the agent is looping or failing to use cached results.
    """
    seen = set()
    duplicates = 0
    for step in trajectory.tool_calls():
        key = (step.tool_name, json.dumps(step.tool_input, sort_keys=True))
        if key in seen:
            duplicates += 1
        seen.add(key)

    total = len(trajectory.tool_calls())
    return duplicates / total if total > 0 else 0.0


def error_recovery_rate(trajectory: AgentTrajectory) -> float:
    """
    Fraction of tool errors that were followed by a successful alternative action.
    Measures the agent's resilience to partial failures.
    """
    error_steps = trajectory.error_steps()
    if not error_steps:
        return 1.0   # No errors to recover from

    recovered = 0
    step_list = trajectory.steps
    for err_step in error_steps:
        # Find the next non-error tool call after this error
        after_error = [
            s for s in step_list
            if s.step_id > err_step.step_id
            and s.step_type in ("tool_call", "answer")
            and s.error is None
        ]
        if after_error:
            recovered += 1

    return recovered / len(error_steps)

Trajectory Similarity Scores

When you have many trajectories, you may want to cluster them or compare pairs without needing gold trajectories. Trajectory similarity can reveal when two agents took fundamentally different approaches:

from difflib import SequenceMatcher


def trajectory_similarity(t1: AgentTrajectory, t2: AgentTrajectory) -> float:
    """
    Sequence similarity between two tool-call sequences.
    Returns 1.0 for identical tool sequences, 0.0 for completely different.
    """
    seq1 = [s.tool_name for s in t1.tool_calls()]
    seq2 = [s.tool_name for s in t2.tool_calls()]
    return SequenceMatcher(None, seq1, seq2).ratio()


def tool_distribution(trajectories: list[AgentTrajectory]) -> dict[str, float]:
    """
    Compute the frequency distribution of tool usage across a set of trajectories.
    Useful for identifying over-reliance on a single tool.
    """
    counts: dict[str, int] = {}
    total = 0
    for traj in trajectories:
        for step in traj.tool_calls():
            counts[step.tool_name] = counts.get(step.tool_name, 0) + 1
            total += 1

    if total == 0:
        return {}
    return {tool: count / total for tool, count in sorted(
        counts.items(), key=lambda x: -x[1]
    )}

Putting It Together: A Trajectory Evaluation Report

def evaluate_trajectory_batch(
    predicted_trajectories: list[AgentTrajectory],
    gold_trajectories: list[AgentTrajectory],
) -> dict:
    """
    Compute a full evaluation report for a batch of agent trajectories.
    """
    assert len(predicted_trajectories) == len(gold_trajectories)

    results = []
    for pred, gold in zip(predicted_trajectories, gold_trajectories):
        results.append({
            "task_id": pred.task_id,
            "succeeded": pred.succeeded,
            "step_precision": step_precision(pred, gold),
            "step_recall": step_recall(pred, gold),
            "step_f1": step_f1(pred, gold),
            "efficiency": tool_efficiency_score(pred, gold),
            "redundancy": redundancy_rate(pred),
            "error_recovery": error_recovery_rate(pred),
            "total_tokens": pred.total_tokens,
            "total_latency_ms": pred.total_latency_ms,
        })

    n = len(results)
    return {
        "n": n,
        "pass_rate": sum(r["succeeded"] for r in results) / n,
        "avg_step_f1": sum(r["step_f1"] for r in results) / n,
        "avg_efficiency": sum(r["efficiency"] for r in results) / n,
        "avg_redundancy": sum(r["redundancy"] for r in results) / n,
        "avg_error_recovery": sum(r["error_recovery"] for r in results) / n,
        "avg_tokens_per_task": sum(r["total_tokens"] for r in results) / n,
        "avg_latency_ms": sum(r["total_latency_ms"] for r in results) / n,
        "per_task": results,
    }

Interpreting Trajectory Metrics

MetricHigh Score MeansLow Score Means
Step F1Agent follows gold path closelyAgent takes different approach (may still be valid)
EfficiencyAgent uses minimal tool callsAgent over-explores or loops
Redundancy← low is betterHigh = agent calling same tool repeatedly
Error recoveryAgent handles failures gracefullyAgent gets stuck after first error

Tip: High step F1 is not always desirable. If your gold trajectories are from a weaker reference agent, a high-quality model may correctly find a shorter path. Always pair trajectory metrics with final-answer correctness — a trajectory is "correct" if it leads to the right answer efficiently, regardless of whether it matches the gold path exactly.


Summary

Trajectory evaluation gives you X-ray vision into your agent's reasoning process. By logging every tool call, measuring step precision and recall against gold trajectories, and computing efficiency scores, you can identify exactly where agents go wrong — not just that they produce a wrong final answer. Combine trajectory metrics with final-answer evaluation for a complete picture of agent quality.