Evaluating Agents: Benchmarks and What They Actually Measure

Why Benchmarks Matter

Building an agent is the easy part. Knowing whether it actually works — and whether it works better than the previous version — is the hard part. Benchmarks provide standardised, reproducible tests that let you compare agent performance across models, architectures, and prompt strategies without relying on vibes.

But benchmarks are not oracles. They measure specific capabilities under controlled conditions, and those conditions may not match your production workload. Understanding what each benchmark measures — and what it misses — is as important as understanding its scores.

The Major Agent Benchmarks

SWE-bench

What it tests: Real-world software engineering — resolving GitHub issues in open-source Python repositories.

Each instance in SWE-bench is a real issue + a gold patch (the pull request that fixed it). The agent receives the issue description, the repository codebase, and must produce a patch that makes the repository's test suite pass.

Property	Value
Release year	2023
Task count	2,294 (SWE-bench Verified: 500)
Success metric	% of issues with passing tests after agent patch
Difficulty	Very high — top agents score ~20–50% on Verified
Required capabilities	Code reading, multi-file editing, test execution

# Install and run SWE-bench
pip install swebench

# Evaluate a set of agent predictions
python -m swebench.harness.run_evaluation \
    --dataset_name princeton-nlp/SWE-bench_Verified \
    --predictions_path ./my_agent_predictions.jsonl \
    --max_workers 4 \
    --run_id my_experiment_01

Note: SWE-bench Verified is the recommended subset. The full SWE-bench contains some issues with ambiguous or incorrect gold patches. The Verified split was manually reviewed to ensure issue quality.

HumanEval

What it tests: Python function synthesis from docstrings. The model reads a function signature + docstring and must complete the implementation.

# Example HumanEval problem
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    Check if in given list of numbers, are any two numbers closer to each
    other than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    # Agent must implement this

Property	Value
Release year	2021
Task count	164
Success metric	pass@k (fraction of k samples that pass all unit tests)
Difficulty	Moderate — frontier models score 85–95%+
Limitation	Code-only; no tool use, no multi-turn interaction

HumanEval is better suited for evaluating base language model coding ability than agentic behaviour. It lacks tool use, multi-file context, or iterative debugging.

GAIA

What it tests: General AI Assistant — questions that require browsing the web, reading files, running code, and multi-step reasoning.

GAIA problems are intentionally designed to be easy for humans but hard for AI. They require using multiple tools in the right order, maintaining context across many steps, and producing precise, verifiable answers.

Example GAIA Question (Level 2):
"What is the ISBN of the first book ever mentioned in the 
'Further reading' section of the Wikipedia article for the 
capital city of the country where the 2020 Summer Olympics 
were held? Answer with the 13-digit ISBN only."

Property	Value
Difficulty levels	1 (simple), 2 (multi-step), 3 (complex)
Task count	466
Success metric	Exact match of final answer
Top scores	Level 1: ~90%, Level 2: ~70%, Level 3: ~30%
Required capabilities	Web search, file parsing, code execution, multi-hop reasoning

WebArena

What it tests: Autonomous web navigation — completing tasks on realistic websites (shopping, forum posting, coding platform, GitLab, etc.).

WebArena provides a fully sandboxed web environment with functioning web applications. The agent must navigate UIs using browser tools (click, type, scroll, navigate) to complete tasks like "find the cheapest product in the 'Keyboards' category on the shopping site and add it to the cart."

Property	Value
Release year	2023
Task count	812
Success metric	Task success rate (functional correctness)
Difficulty	High — top agents score 30–50%
Infrastructure	Requires running local web servers (Docker)

# WebArena setup
git clone https://github.com/web-arena-x/webarena
cd webarena
docker compose up -d    # Starts all web environments
pip install -r requirements.txt

# Run evaluation
python run.py \
    --instruction_path config_files/test_webarena.json \
    --result_dir ./results \
    --model gpt-4o \
    --action_set_tag som

AgentBench

What it tests: A unified multi-environment benchmark covering 8 distinct agent tasks including OS navigation, database querying, knowledge graph traversal, and digital card games.

AgentBench is valuable because it tests breadth — an agent that excels at web browsing but fails at database queries may still score well on single-environment benchmarks. The composite score reveals capability gaps.

Environment	Task Type	Difficulty
OS	Shell commands, file management	Medium
DB	SQL query generation and execution	Medium
KG	SPARQL over knowledge graphs	Hard
WebShop	E-commerce navigation	Medium
Mind2Web	Web task following	Hard
AlfWorld	Embodied task completion	Medium
Card Game	Competitive game strategy	Hard
LateX	Document editing	Easy–Medium

Benchmark Comparison

Benchmark	Tasks	Tool Use	Multi-step	Real Env	Open Source
HumanEval	164	No	No	No	Yes
SWE-bench	2,294	Yes	Yes	Yes	Yes
GAIA	466	Yes	Yes	No	Partial
WebArena	812	Yes	Yes	Yes	Yes
AgentBench	~1,800	Yes	Yes	Mixed	Yes

Running a Simple Benchmark Evaluation

Here is a minimal harness for evaluating your agent on a custom or standard benchmark:

import json
import asyncio
from dataclasses import dataclass, field
from typing import Callable, Any
from datetime import datetime


@dataclass
class BenchmarkResult:
    task_id: str
    passed: bool
    agent_output: Any
    expected_output: Any
    latency_ms: float
    token_count: int
    error: str | None = None


@dataclass
class BenchmarkReport:
    benchmark_name: str
    model: str
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    results: list[BenchmarkResult] = field(default_factory=list)

    @property
    def pass_rate(self) -> float:
        if not self.results:
            return 0.0
        return sum(1 for r in self.results if r.passed) / len(self.results)

    @property
    def avg_latency_ms(self) -> float:
        if not self.results:
            return 0.0
        return sum(r.latency_ms for r in self.results) / len(self.results)

    def to_dict(self) -> dict:
        return {
            "benchmark": self.benchmark_name,
            "model": self.model,
            "timestamp": self.timestamp,
            "pass_rate": self.pass_rate,
            "avg_latency_ms": self.avg_latency_ms,
            "total_tasks": len(self.results),
            "passed": sum(1 for r in self.results if r.passed),
            "failed": sum(1 for r in self.results if not r.passed),
        }


async def evaluate_agent(
    agent_fn: Callable,
    dataset: list[dict],
    evaluator_fn: Callable,
    benchmark_name: str,
    model: str,
    max_concurrent: int = 4,
) -> BenchmarkReport:
    """
    Evaluate an agent against a benchmark dataset.

    Args:
        agent_fn: async function that takes a task dict and returns an answer
        dataset: list of {"id": ..., "input": ..., "expected": ...} dicts
        evaluator_fn: function(agent_output, expected) -> bool
        benchmark_name: display name for this benchmark run
        model: model identifier string for reporting
        max_concurrent: number of tasks to run in parallel
    """
    report = BenchmarkReport(benchmark_name=benchmark_name, model=model)
    semaphore = asyncio.Semaphore(max_concurrent)

    async def run_single(task: dict) -> BenchmarkResult:
        async with semaphore:
            start_ms = asyncio.get_event_loop().time() * 1000
            try:
                output = await agent_fn(task)
                latency = asyncio.get_event_loop().time() * 1000 - start_ms
                passed = evaluator_fn(output["answer"], task["expected"])
                return BenchmarkResult(
                    task_id=task["id"],
                    passed=passed,
                    agent_output=output["answer"],
                    expected_output=task["expected"],
                    latency_ms=latency,
                    token_count=output.get("token_count", 0),
                )
            except Exception as exc:
                latency = asyncio.get_event_loop().time() * 1000 - start_ms
                return BenchmarkResult(
                    task_id=task["id"],
                    passed=False,
                    agent_output=None,
                    expected_output=task["expected"],
                    latency_ms=latency,
                    token_count=0,
                    error=str(exc),
                )

    tasks = [run_single(t) for t in dataset]
    report.results = await asyncio.gather(*tasks)

    # Print progress report
    print(f"\n{'='*50}")
    print(f"Benchmark: {benchmark_name}")
    print(f"Model:     {model}")
    print(f"Pass rate: {report.pass_rate:.1%} ({sum(1 for r in report.results if r.passed)}/{len(report.results)})")
    print(f"Avg latency: {report.avg_latency_ms:.0f}ms")
    print(f"{'='*50}\n")

    # Save results
    with open(f"results_{benchmark_name}_{model}.json", "w") as f:
        json.dump([
            {
                "task_id": r.task_id,
                "passed": r.passed,
                "latency_ms": r.latency_ms,
                "token_count": r.token_count,
                "error": r.error,
            }
            for r in report.results
        ], f, indent=2)

    return report

Benchmark Limitations and What They Miss

Understanding what benchmarks do not measure is equally important:

Limitation	Explanation	Mitigation
Distribution shift	Tasks may not match your domain	Build custom evals for your use case
Gaming	Agents can overfit to benchmark patterns	Use held-out test sets; rotate benchmarks
Evaluation noise	LLM judges are inconsistent	Use multiple judges; prefer deterministic evals
Cost blindness	Pass rate ignores token cost	Track tokens-per-task alongside accuracy
Single-turn bias	Many benchmarks don't test multi-turn	Supplement with conversation-level evals
English-centric	Most benchmarks are English-only	Evaluate multilingual performance separately

Summary

Benchmarks are the empirical foundation of agent development. SWE-bench tests realistic coding work; GAIA tests multi-modal tool use; WebArena tests real UI navigation. No single benchmark covers everything — use a combination that reflects your production use case. Build your own evaluation harness so you can run consistent, automated evaluations on every significant change, and always report alongside pass rate: cost per task, latency, and failure mode distribution.