Evaluating Agents: Benchmarks and What They Actually Measure
Why Benchmarks Matter
Building an agent is the easy part. Knowing whether it actually works — and whether it works better than the previous version — is the hard part. Benchmarks provide standardised, reproducible tests that let you compare agent performance across models, architectures, and prompt strategies without relying on vibes.
But benchmarks are not oracles. They measure specific capabilities under controlled conditions, and those conditions may not match your production workload. Understanding what each benchmark measures — and what it misses — is as important as understanding its scores.
The Major Agent Benchmarks
SWE-bench
What it tests: Real-world software engineering — resolving GitHub issues in open-source Python repositories.
Each instance in SWE-bench is a real issue + a gold patch (the pull request that fixed it). The agent receives the issue description, the repository codebase, and must produce a patch that makes the repository's test suite pass.
| Property | Value |
|---|---|
| Release year | 2023 |
| Task count | 2,294 (SWE-bench Verified: 500) |
| Success metric | % of issues with passing tests after agent patch |
| Difficulty | Very high — top agents score ~20–50% on Verified |
| Required capabilities | Code reading, multi-file editing, test execution |
# Install and run SWE-bench
pip install swebench
# Evaluate a set of agent predictions
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Verified \
--predictions_path ./my_agent_predictions.jsonl \
--max_workers 4 \
--run_id my_experiment_01
Note: SWE-bench Verified is the recommended subset. The full SWE-bench contains some issues with ambiguous or incorrect gold patches. The Verified split was manually reviewed to ensure issue quality.
HumanEval
What it tests: Python function synthesis from docstrings. The model reads a function signature + docstring and must complete the implementation.
# Example HumanEval problem
def has_close_elements(numbers: List[float], threshold: float) -> bool:
"""
Check if in given list of numbers, are any two numbers closer to each
other than given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
# Agent must implement this
| Property | Value |
|---|---|
| Release year | 2021 |
| Task count | 164 |
| Success metric | pass@k (fraction of k samples that pass all unit tests) |
| Difficulty | Moderate — frontier models score 85–95%+ |
| Limitation | Code-only; no tool use, no multi-turn interaction |
HumanEval is better suited for evaluating base language model coding ability than agentic behaviour. It lacks tool use, multi-file context, or iterative debugging.
GAIA
What it tests: General AI Assistant — questions that require browsing the web, reading files, running code, and multi-step reasoning.
GAIA problems are intentionally designed to be easy for humans but hard for AI. They require using multiple tools in the right order, maintaining context across many steps, and producing precise, verifiable answers.
Example GAIA Question (Level 2):
"What is the ISBN of the first book ever mentioned in the
'Further reading' section of the Wikipedia article for the
capital city of the country where the 2020 Summer Olympics
were held? Answer with the 13-digit ISBN only."
| Property | Value |
|---|---|
| Difficulty levels | 1 (simple), 2 (multi-step), 3 (complex) |
| Task count | 466 |
| Success metric | Exact match of final answer |
| Top scores | Level 1: ~90%, Level 2: ~70%, Level 3: ~30% |
| Required capabilities | Web search, file parsing, code execution, multi-hop reasoning |
WebArena
What it tests: Autonomous web navigation — completing tasks on realistic websites (shopping, forum posting, coding platform, GitLab, etc.).
WebArena provides a fully sandboxed web environment with functioning web applications. The agent must navigate UIs using browser tools (click, type, scroll, navigate) to complete tasks like "find the cheapest product in the 'Keyboards' category on the shopping site and add it to the cart."
| Property | Value |
|---|---|
| Release year | 2023 |
| Task count | 812 |
| Success metric | Task success rate (functional correctness) |
| Difficulty | High — top agents score 30–50% |
| Infrastructure | Requires running local web servers (Docker) |
# WebArena setup
git clone https://github.com/web-arena-x/webarena
cd webarena
docker compose up -d # Starts all web environments
pip install -r requirements.txt
# Run evaluation
python run.py \
--instruction_path config_files/test_webarena.json \
--result_dir ./results \
--model gpt-4o \
--action_set_tag som
AgentBench
What it tests: A unified multi-environment benchmark covering 8 distinct agent tasks including OS navigation, database querying, knowledge graph traversal, and digital card games.
AgentBench is valuable because it tests breadth — an agent that excels at web browsing but fails at database queries may still score well on single-environment benchmarks. The composite score reveals capability gaps.
| Environment | Task Type | Difficulty |
|---|---|---|
| OS | Shell commands, file management | Medium |
| DB | SQL query generation and execution | Medium |
| KG | SPARQL over knowledge graphs | Hard |
| WebShop | E-commerce navigation | Medium |
| Mind2Web | Web task following | Hard |
| AlfWorld | Embodied task completion | Medium |
| Card Game | Competitive game strategy | Hard |
| LateX | Document editing | Easy–Medium |
Benchmark Comparison
| Benchmark | Tasks | Tool Use | Multi-step | Real Env | Open Source |
|---|---|---|---|---|---|
| HumanEval | 164 | No | No | No | Yes |
| SWE-bench | 2,294 | Yes | Yes | Yes | Yes |
| GAIA | 466 | Yes | Yes | No | Partial |
| WebArena | 812 | Yes | Yes | Yes | Yes |
| AgentBench | ~1,800 | Yes | Yes | Mixed | Yes |
Running a Simple Benchmark Evaluation
Here is a minimal harness for evaluating your agent on a custom or standard benchmark:
import json
import asyncio
from dataclasses import dataclass, field
from typing import Callable, Any
from datetime import datetime
@dataclass
class BenchmarkResult:
task_id: str
passed: bool
agent_output: Any
expected_output: Any
latency_ms: float
token_count: int
error: str | None = None
@dataclass
class BenchmarkReport:
benchmark_name: str
model: str
timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
results: list[BenchmarkResult] = field(default_factory=list)
@property
def pass_rate(self) -> float:
if not self.results:
return 0.0
return sum(1 for r in self.results if r.passed) / len(self.results)
@property
def avg_latency_ms(self) -> float:
if not self.results:
return 0.0
return sum(r.latency_ms for r in self.results) / len(self.results)
def to_dict(self) -> dict:
return {
"benchmark": self.benchmark_name,
"model": self.model,
"timestamp": self.timestamp,
"pass_rate": self.pass_rate,
"avg_latency_ms": self.avg_latency_ms,
"total_tasks": len(self.results),
"passed": sum(1 for r in self.results if r.passed),
"failed": sum(1 for r in self.results if not r.passed),
}
async def evaluate_agent(
agent_fn: Callable,
dataset: list[dict],
evaluator_fn: Callable,
benchmark_name: str,
model: str,
max_concurrent: int = 4,
) -> BenchmarkReport:
"""
Evaluate an agent against a benchmark dataset.
Args:
agent_fn: async function that takes a task dict and returns an answer
dataset: list of {"id": ..., "input": ..., "expected": ...} dicts
evaluator_fn: function(agent_output, expected) -> bool
benchmark_name: display name for this benchmark run
model: model identifier string for reporting
max_concurrent: number of tasks to run in parallel
"""
report = BenchmarkReport(benchmark_name=benchmark_name, model=model)
semaphore = asyncio.Semaphore(max_concurrent)
async def run_single(task: dict) -> BenchmarkResult:
async with semaphore:
start_ms = asyncio.get_event_loop().time() * 1000
try:
output = await agent_fn(task)
latency = asyncio.get_event_loop().time() * 1000 - start_ms
passed = evaluator_fn(output["answer"], task["expected"])
return BenchmarkResult(
task_id=task["id"],
passed=passed,
agent_output=output["answer"],
expected_output=task["expected"],
latency_ms=latency,
token_count=output.get("token_count", 0),
)
except Exception as exc:
latency = asyncio.get_event_loop().time() * 1000 - start_ms
return BenchmarkResult(
task_id=task["id"],
passed=False,
agent_output=None,
expected_output=task["expected"],
latency_ms=latency,
token_count=0,
error=str(exc),
)
tasks = [run_single(t) for t in dataset]
report.results = await asyncio.gather(*tasks)
# Print progress report
print(f"\n{'='*50}")
print(f"Benchmark: {benchmark_name}")
print(f"Model: {model}")
print(f"Pass rate: {report.pass_rate:.1%} ({sum(1 for r in report.results if r.passed)}/{len(report.results)})")
print(f"Avg latency: {report.avg_latency_ms:.0f}ms")
print(f"{'='*50}\n")
# Save results
with open(f"results_{benchmark_name}_{model}.json", "w") as f:
json.dump([
{
"task_id": r.task_id,
"passed": r.passed,
"latency_ms": r.latency_ms,
"token_count": r.token_count,
"error": r.error,
}
for r in report.results
], f, indent=2)
return report
Benchmark Limitations and What They Miss
Understanding what benchmarks do not measure is equally important:
| Limitation | Explanation | Mitigation |
|---|---|---|
| Distribution shift | Tasks may not match your domain | Build custom evals for your use case |
| Gaming | Agents can overfit to benchmark patterns | Use held-out test sets; rotate benchmarks |
| Evaluation noise | LLM judges are inconsistent | Use multiple judges; prefer deterministic evals |
| Cost blindness | Pass rate ignores token cost | Track tokens-per-task alongside accuracy |
| Single-turn bias | Many benchmarks don't test multi-turn | Supplement with conversation-level evals |
| English-centric | Most benchmarks are English-only | Evaluate multilingual performance separately |
Summary
Benchmarks are the empirical foundation of agent development. SWE-bench tests realistic coding work; GAIA tests multi-modal tool use; WebArena tests real UI navigation. No single benchmark covers everything — use a combination that reflects your production use case. Build your own evaluation harness so you can run consistent, automated evaluations on every significant change, and always report alongside pass rate: cost per task, latency, and failure mode distribution.