Evaluation and Benchmarking

Agent Benchmarks and Datasets

13m read

Evaluating Agents: Benchmarks and What They Actually Measure

Why Benchmarks Matter

Building an agent is the easy part. Knowing whether it actually works — and whether it works better than the previous version — is the hard part. Benchmarks provide standardised, reproducible tests that let you compare agent performance across models, architectures, and prompt strategies without relying on vibes.

But benchmarks are not oracles. They measure specific capabilities under controlled conditions, and those conditions may not match your production workload. Understanding what each benchmark measures — and what it misses — is as important as understanding its scores.


The Major Agent Benchmarks

SWE-bench

What it tests: Real-world software engineering — resolving GitHub issues in open-source Python repositories.

Each instance in SWE-bench is a real issue + a gold patch (the pull request that fixed it). The agent receives the issue description, the repository codebase, and must produce a patch that makes the repository's test suite pass.

PropertyValue
Release year2023
Task count2,294 (SWE-bench Verified: 500)
Success metric% of issues with passing tests after agent patch
DifficultyVery high — top agents score ~20–50% on Verified
Required capabilitiesCode reading, multi-file editing, test execution
# Install and run SWE-bench
pip install swebench

# Evaluate a set of agent predictions
python -m swebench.harness.run_evaluation \
    --dataset_name princeton-nlp/SWE-bench_Verified \
    --predictions_path ./my_agent_predictions.jsonl \
    --max_workers 4 \
    --run_id my_experiment_01

Note: SWE-bench Verified is the recommended subset. The full SWE-bench contains some issues with ambiguous or incorrect gold patches. The Verified split was manually reviewed to ensure issue quality.


HumanEval

What it tests: Python function synthesis from docstrings. The model reads a function signature + docstring and must complete the implementation.

# Example HumanEval problem
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    Check if in given list of numbers, are any two numbers closer to each
    other than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    # Agent must implement this
PropertyValue
Release year2021
Task count164
Success metricpass@k (fraction of k samples that pass all unit tests)
DifficultyModerate — frontier models score 85–95%+
LimitationCode-only; no tool use, no multi-turn interaction

HumanEval is better suited for evaluating base language model coding ability than agentic behaviour. It lacks tool use, multi-file context, or iterative debugging.


GAIA

What it tests: General AI Assistant — questions that require browsing the web, reading files, running code, and multi-step reasoning.

GAIA problems are intentionally designed to be easy for humans but hard for AI. They require using multiple tools in the right order, maintaining context across many steps, and producing precise, verifiable answers.

Example GAIA Question (Level 2):
"What is the ISBN of the first book ever mentioned in the 
'Further reading' section of the Wikipedia article for the 
capital city of the country where the 2020 Summer Olympics 
were held? Answer with the 13-digit ISBN only."
PropertyValue
Difficulty levels1 (simple), 2 (multi-step), 3 (complex)
Task count466
Success metricExact match of final answer
Top scoresLevel 1: ~90%, Level 2: ~70%, Level 3: ~30%
Required capabilitiesWeb search, file parsing, code execution, multi-hop reasoning

WebArena

What it tests: Autonomous web navigation — completing tasks on realistic websites (shopping, forum posting, coding platform, GitLab, etc.).

WebArena provides a fully sandboxed web environment with functioning web applications. The agent must navigate UIs using browser tools (click, type, scroll, navigate) to complete tasks like "find the cheapest product in the 'Keyboards' category on the shopping site and add it to the cart."

PropertyValue
Release year2023
Task count812
Success metricTask success rate (functional correctness)
DifficultyHigh — top agents score 30–50%
InfrastructureRequires running local web servers (Docker)
# WebArena setup
git clone https://github.com/web-arena-x/webarena
cd webarena
docker compose up -d    # Starts all web environments
pip install -r requirements.txt

# Run evaluation
python run.py \
    --instruction_path config_files/test_webarena.json \
    --result_dir ./results \
    --model gpt-4o \
    --action_set_tag som

AgentBench

What it tests: A unified multi-environment benchmark covering 8 distinct agent tasks including OS navigation, database querying, knowledge graph traversal, and digital card games.

AgentBench is valuable because it tests breadth — an agent that excels at web browsing but fails at database queries may still score well on single-environment benchmarks. The composite score reveals capability gaps.

EnvironmentTask TypeDifficulty
OSShell commands, file managementMedium
DBSQL query generation and executionMedium
KGSPARQL over knowledge graphsHard
WebShopE-commerce navigationMedium
Mind2WebWeb task followingHard
AlfWorldEmbodied task completionMedium
Card GameCompetitive game strategyHard
LateXDocument editingEasy–Medium

Benchmark Comparison

BenchmarkTasksTool UseMulti-stepReal EnvOpen Source
HumanEval164NoNoNoYes
SWE-bench2,294YesYesYesYes
GAIA466YesYesNoPartial
WebArena812YesYesYesYes
AgentBench~1,800YesYesMixedYes

Running a Simple Benchmark Evaluation

Here is a minimal harness for evaluating your agent on a custom or standard benchmark:

import json
import asyncio
from dataclasses import dataclass, field
from typing import Callable, Any
from datetime import datetime


@dataclass
class BenchmarkResult:
    task_id: str
    passed: bool
    agent_output: Any
    expected_output: Any
    latency_ms: float
    token_count: int
    error: str | None = None


@dataclass
class BenchmarkReport:
    benchmark_name: str
    model: str
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    results: list[BenchmarkResult] = field(default_factory=list)

    @property
    def pass_rate(self) -> float:
        if not self.results:
            return 0.0
        return sum(1 for r in self.results if r.passed) / len(self.results)

    @property
    def avg_latency_ms(self) -> float:
        if not self.results:
            return 0.0
        return sum(r.latency_ms for r in self.results) / len(self.results)

    def to_dict(self) -> dict:
        return {
            "benchmark": self.benchmark_name,
            "model": self.model,
            "timestamp": self.timestamp,
            "pass_rate": self.pass_rate,
            "avg_latency_ms": self.avg_latency_ms,
            "total_tasks": len(self.results),
            "passed": sum(1 for r in self.results if r.passed),
            "failed": sum(1 for r in self.results if not r.passed),
        }


async def evaluate_agent(
    agent_fn: Callable,
    dataset: list[dict],
    evaluator_fn: Callable,
    benchmark_name: str,
    model: str,
    max_concurrent: int = 4,
) -> BenchmarkReport:
    """
    Evaluate an agent against a benchmark dataset.

    Args:
        agent_fn: async function that takes a task dict and returns an answer
        dataset: list of {"id": ..., "input": ..., "expected": ...} dicts
        evaluator_fn: function(agent_output, expected) -> bool
        benchmark_name: display name for this benchmark run
        model: model identifier string for reporting
        max_concurrent: number of tasks to run in parallel
    """
    report = BenchmarkReport(benchmark_name=benchmark_name, model=model)
    semaphore = asyncio.Semaphore(max_concurrent)

    async def run_single(task: dict) -> BenchmarkResult:
        async with semaphore:
            start_ms = asyncio.get_event_loop().time() * 1000
            try:
                output = await agent_fn(task)
                latency = asyncio.get_event_loop().time() * 1000 - start_ms
                passed = evaluator_fn(output["answer"], task["expected"])
                return BenchmarkResult(
                    task_id=task["id"],
                    passed=passed,
                    agent_output=output["answer"],
                    expected_output=task["expected"],
                    latency_ms=latency,
                    token_count=output.get("token_count", 0),
                )
            except Exception as exc:
                latency = asyncio.get_event_loop().time() * 1000 - start_ms
                return BenchmarkResult(
                    task_id=task["id"],
                    passed=False,
                    agent_output=None,
                    expected_output=task["expected"],
                    latency_ms=latency,
                    token_count=0,
                    error=str(exc),
                )

    tasks = [run_single(t) for t in dataset]
    report.results = await asyncio.gather(*tasks)

    # Print progress report
    print(f"\n{'='*50}")
    print(f"Benchmark: {benchmark_name}")
    print(f"Model:     {model}")
    print(f"Pass rate: {report.pass_rate:.1%} ({sum(1 for r in report.results if r.passed)}/{len(report.results)})")
    print(f"Avg latency: {report.avg_latency_ms:.0f}ms")
    print(f"{'='*50}\n")

    # Save results
    with open(f"results_{benchmark_name}_{model}.json", "w") as f:
        json.dump([
            {
                "task_id": r.task_id,
                "passed": r.passed,
                "latency_ms": r.latency_ms,
                "token_count": r.token_count,
                "error": r.error,
            }
            for r in report.results
        ], f, indent=2)

    return report

Benchmark Limitations and What They Miss

Understanding what benchmarks do not measure is equally important:

LimitationExplanationMitigation
Distribution shiftTasks may not match your domainBuild custom evals for your use case
GamingAgents can overfit to benchmark patternsUse held-out test sets; rotate benchmarks
Evaluation noiseLLM judges are inconsistentUse multiple judges; prefer deterministic evals
Cost blindnessPass rate ignores token costTrack tokens-per-task alongside accuracy
Single-turn biasMany benchmarks don't test multi-turnSupplement with conversation-level evals
English-centricMost benchmarks are English-onlyEvaluate multilingual performance separately

Summary

Benchmarks are the empirical foundation of agent development. SWE-bench tests realistic coding work; GAIA tests multi-modal tool use; WebArena tests real UI navigation. No single benchmark covers everything — use a combination that reflects your production use case. Build your own evaluation harness so you can run consistent, automated evaluations on every significant change, and always report alongside pass rate: cost per task, latency, and failure mode distribution.