Evaluation and Benchmarking

Building a Custom Eval Framework

13m read

Building a Custom Evaluation Framework

Why Custom Evals Are Non-Negotiable

Public benchmarks measure general capability. Your production agent serves a specific domain with specific expectations. A customer support agent that scores 90% on GAIA might still fail on 40% of real tickets — because GAIA doesn't contain your product's terminology, your edge cases, or your users' writing styles.

Building a custom evaluation framework is one of the highest-ROI investments you can make in agent quality. Done well, it gives you:

  • Regression detection — catch degradations before they reach production
  • Prompt comparison — A/B test prompt changes with statistical confidence
  • Model comparison — choose between providers with objective data
  • CI integration — automated quality gates on every deployment

Framework Architecture

┌──────────────────────────────────────────────────────────────────┐
│                    Eval Framework                                │
│                                                                  │
│  ┌────────────────┐   ┌───────────────┐   ┌──────────────────┐  │
│  │  Dataset       │   │  Agent Under  │   │  Scorer          │  │
│  │  (tasks +      │──▶│  Test (AUT)   │──▶│  (LLM-as-judge / │  │
│  │   expected)    │   │               │   │   deterministic) │  │
│  └────────────────┘   └───────────────┘   └────────┬─────────┘  │
│                                                     │            │
│  ┌──────────────────────────────────────────────────▼─────────┐  │
│  │                    Results Store                            │  │
│  │  (per-task scores, metadata, trajectories, run info)       │  │
│  └────────────────────────────────────────────────────────────┘  │
│                              │                                   │
│              ┌───────────────┼───────────────┐                  │
│              ▼               ▼               ▼                  │
│         [CLI Report]   [JSON Export]   [CI Pass/Fail]           │
└──────────────────────────────────────────────────────────────────┘

Step 1: Designing Your Eval Dataset

Your dataset is the foundation. Every other component is only as good as the tasks it tests.

Dataset Schema

from dataclasses import dataclass, field
from typing import Any, Literal


@dataclass
class EvalTask:
    """A single evaluation task with input, expected output, and metadata."""
    id: str
    input: str | dict                    # Task prompt or structured input
    expected: Any                        # Ground truth (string, list, dict, etc.)
    difficulty: Literal["easy", "medium", "hard"] = "medium"
    category: str = "general"           # Domain/topic tag
    tags: list[str] = field(default_factory=list)
    rubric: str | None = None           # Scoring rubric for LLM judge
    weight: float = 1.0                 # Relative importance in aggregate score


@dataclass
class EvalDataset:
    """Collection of evaluation tasks with metadata."""
    name: str
    version: str
    tasks: list[EvalTask]
    description: str = ""

    def filter_by_category(self, category: str) -> "EvalDataset":
        filtered = [t for t in self.tasks if t.category == category]
        return EvalDataset(
            name=f"{self.name}/{category}",
            version=self.version,
            tasks=filtered,
        )

    def filter_by_difficulty(self, difficulty: str) -> "EvalDataset":
        filtered = [t for t in self.tasks if t.difficulty == difficulty]
        return EvalDataset(
            name=f"{self.name}/{difficulty}",
            version=self.version,
            tasks=filtered,
        )

Dataset Creation Best Practices

Seed from production data. The best eval tasks come from real user inputs. Sample queries from your logs (with PII removed) and have domain experts annotate the expected answers.

Stratify by difficulty. Aim for roughly 30% easy, 50% medium, 20% hard. Easy tasks catch regressions quickly; hard tasks differentiate strong agents from weak ones.

Include adversarial cases. Tasks that historically caused failures, edge cases, off-topic inputs that should be declined, and ambiguous phrasing that requires clarification.

# Example eval dataset for a customer support agent
support_eval = EvalDataset(
    name="customer-support-v1",
    version="1.0.0",
    tasks=[
        EvalTask(
            id="cs_001",
            input="How do I reset my password?",
            expected="Provide clear step-by-step password reset instructions including the reset link flow",
            difficulty="easy",
            category="account",
            rubric="Response must include: (1) link to reset page, (2) expiry time of reset link, (3) contact info if link doesn't arrive",
        ),
        EvalTask(
            id="cs_002",
            input="I was charged twice for my order #A1234. I want a refund NOW",
            expected="Acknowledge urgency, look up order, initiate refund process, provide timeline",
            difficulty="medium",
            category="billing",
            rubric="Response must: (1) acknowledge frustration empathetically, (2) not promise a specific refund amount without verification, (3) provide a tracking reference",
        ),
        EvalTask(
            id="cs_adv_001",
            input="Can you give me a 50% discount? My neighbor said you do this",
            expected="Decline politely without confirming unverified claims, offer legitimate promotions",
            difficulty="hard",
            category="adversarial",
            tags=["social-engineering", "discount"],
        ),
    ],
)

Step 2: Automatic Scoring with LLM-as-Judge

For open-ended responses, use a powerful LLM as an evaluator. The key is a well-designed scoring prompt.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
import json


JUDGE_SYSTEM = """You are an expert evaluator for AI agent responses.
Your job is to score responses on a scale from 0 to 10 based on:

1. Accuracy (0-4 points): Is the information correct and complete?
2. Helpfulness (0-3 points): Does it actually help the user achieve their goal?
3. Format (0-2 points): Is it well-structured and appropriately concise?
4. Safety (0-1 point): Does it avoid harmful, misleading, or inappropriate content?

Output JSON only in this format:
{
  "accuracy": <0-4>,
  "helpfulness": <0-3>,
  "format": <0-2>,
  "safety": <0-1>,
  "total": <0-10>,
  "reasoning": "<1-2 sentences explaining the score>"
}
"""

JUDGE_HUMAN = """Task: {task_input}

Expected answer guidance: {expected}

Rubric (if provided): {rubric}

Agent's response: {agent_response}

Score this response."""


class LLMJudge:
    """
    LLM-as-judge scorer for open-ended agent responses.
    Uses a structured rubric to produce reproducible, interpretable scores.
    """

    def __init__(self, model: str = "gpt-4o", temperature: float = 0.0):
        self.llm = ChatOpenAI(model=model, temperature=temperature)
        self.prompt = ChatPromptTemplate.from_messages([
            ("system", JUDGE_SYSTEM),
            ("human", JUDGE_HUMAN),
        ])
        self.chain = self.prompt | self.llm

    def score(self, task: EvalTask, agent_response: str) -> dict:
        """
        Score a single agent response.
        Returns a dict with per-dimension scores and reasoning.
        """
        response = self.chain.invoke({
            "task_input": task.input
            if isinstance(task.input, str)
            else json.dumps(task.input),
            "expected": task.expected,
            "rubric": task.rubric or "No specific rubric — use general quality criteria.",
            "agent_response": agent_response,
        })
        try:
            scores = json.loads(response.content)
        except json.JSONDecodeError:
            # Fallback: extract JSON from markdown fences
            raw = response.content
            if "```json" in raw:
                raw = raw.split("```json")[1].split("```")[0]
            scores = json.loads(raw)

        # Normalise to [0, 1]
        scores["normalised"] = scores["total"] / 10.0
        return scores

    def score_batch(
        self, tasks: list[EvalTask], responses: list[str]
    ) -> list[dict]:
        """Score a batch of responses, one per task."""
        return [self.score(task, resp) for task, resp in zip(tasks, responses)]

Tip: Always use temperature=0 for the judge model and run each score twice to check consistency. If the two scores differ by more than 2 points, treat the result as low-confidence and flag for human review.


Step 3: The Eval Runner

import asyncio
import json
import time
from pathlib import Path
from typing import Callable, Awaitable


class EvalRunner:
    """
    Orchestrates the full eval loop: run agent, score responses,
    aggregate results, detect regressions, generate report.
    """

    def __init__(
        self,
        agent_fn: Callable[[EvalTask], Awaitable[str]],
        scorer: LLMJudge,
        baseline_path: str | None = None,
        regression_threshold: float = 0.05,  # 5% drop = regression
    ):
        self.agent_fn = agent_fn
        self.scorer = scorer
        self.baseline_path = baseline_path
        self.regression_threshold = regression_threshold
        self._baseline = self._load_baseline() if baseline_path else None

    def _load_baseline(self) -> dict | None:
        path = Path(self.baseline_path)
        if not path.exists():
            return None
        with open(path) as f:
            return json.load(f)

    async def run(
        self,
        dataset: EvalDataset,
        max_concurrent: int = 5,
        save_path: str | None = None,
    ) -> dict:
        """Run the full evaluation loop."""
        semaphore = asyncio.Semaphore(max_concurrent)
        run_start = time.time()

        async def eval_one(task: EvalTask) -> dict:
            async with semaphore:
                step_start = time.monotonic()
                try:
                    response = await self.agent_fn(task)
                    latency = (time.monotonic() - step_start) * 1000
                    scores = self.scorer.score(task, response)
                    return {
                        "task_id": task.id,
                        "category": task.category,
                        "difficulty": task.difficulty,
                        "weight": task.weight,
                        "response": response,
                        "scores": scores,
                        "latency_ms": latency,
                        "error": None,
                    }
                except Exception as exc:
                    return {
                        "task_id": task.id,
                        "category": task.category,
                        "difficulty": task.difficulty,
                        "weight": task.weight,
                        "response": None,
                        "scores": {"normalised": 0.0, "total": 0, "reasoning": str(exc)},
                        "latency_ms": (time.monotonic() - step_start) * 1000,
                        "error": str(exc),
                    }

        results = await asyncio.gather(*[eval_one(t) for t in dataset.tasks])

        # Weighted aggregate score
        total_weight = sum(r["weight"] for r in results)
        weighted_score = sum(
            r["weight"] * r["scores"]["normalised"] for r in results
        ) / total_weight if total_weight > 0 else 0.0

        report = {
            "dataset": dataset.name,
            "version": dataset.version,
            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            "duration_s": time.time() - run_start,
            "n_tasks": len(results),
            "weighted_score": weighted_score,
            "avg_latency_ms": sum(r["latency_ms"] for r in results) / len(results),
            "by_category": self._aggregate_by("category", results),
            "by_difficulty": self._aggregate_by("difficulty", results),
            "results": results,
        }

        # Regression check
        if self._baseline:
            delta = weighted_score - self._baseline["weighted_score"]
            report["regression"] = delta < -self.regression_threshold
            report["score_delta"] = delta

        if save_path:
            with open(save_path, "w") as f:
                json.dump(report, f, indent=2, default=str)

        self._print_summary(report)
        return report

    def _aggregate_by(self, field: str, results: list[dict]) -> dict:
        groups: dict[str, list[float]] = {}
        for r in results:
            key = r[field]
            groups.setdefault(key, []).append(r["scores"]["normalised"])
        return {k: sum(v) / len(v) for k, v in groups.items()}

    def _print_summary(self, report: dict):
        print(f"\n{'='*60}")
        print(f"Eval: {report['dataset']} v{report['version']}")
        print(f"Score: {report['weighted_score']:.1%}")
        if "regression" in report:
            status = "REGRESSION" if report["regression"] else "OK"
            print(f"Regression check: {status} (delta: {report['score_delta']:+.1%})")
        print(f"\nBy category:")
        for cat, score in report["by_category"].items():
            print(f"  {cat:20s} {score:.1%}")
        print(f"\nBy difficulty:")
        for diff, score in report["by_difficulty"].items():
            print(f"  {diff:20s} {score:.1%}")
        print(f"{'='*60}\n")

Step 4: Statistical Significance Testing

Before declaring that a new prompt is "better," verify it with a statistical test. Small sample sizes make random variance look like improvement.

from scipy import stats
import numpy as np


def compare_runs(scores_a: list[float], scores_b: list[float]) -> dict:
    """
    Compare two eval runs using a paired t-test.
    Returns whether the difference is statistically significant.
    """
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
    mean_a = np.mean(scores_a)
    mean_b = np.mean(scores_b)

    return {
        "mean_a": mean_a,
        "mean_b": mean_b,
        "delta": mean_b - mean_a,
        "t_statistic": t_stat,
        "p_value": p_value,
        "significant": p_value < 0.05,
        "conclusion": (
            f"B is {'significantly ' if p_value < 0.05 else 'NOT significantly '}"
            f"{'better' if mean_b > mean_a else 'worse'} than A "
            f"(p={p_value:.3f}, delta={mean_b - mean_a:+.1%})"
        ),
    }

Note: For evals with fewer than 30 tasks, you need Cohen's d effect size alongside the p-value. A statistically significant difference on 20 tasks may have very wide confidence intervals. Aim for at least 100 tasks per category before drawing conclusions.


Step 5: CI Integration

# .github/workflows/eval.yml
name: Agent Eval CI

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/run_eval.py \
            --dataset evals/customer_support_v1.json \
            --baseline evals/baselines/main.json \
            --output evals/results/pr_${{ github.run_id }}.json \
            --fail-on-regression

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: evals/results/

The --fail-on-regression flag causes the script to exit with code 1 if the weighted score drops more than the threshold:

# scripts/run_eval.py
if __name__ == "__main__":
    import argparse, sys, asyncio

    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset")
    parser.add_argument("--baseline")
    parser.add_argument("--output")
    parser.add_argument("--fail-on-regression", action="store_true")
    args = parser.parse_args()

    report = asyncio.run(run_evaluation(args.dataset, args.baseline, args.output))

    if args.fail_on_regression and report.get("regression"):
        print(f"EVAL FAILED: Score dropped {report['score_delta']:.1%}")
        sys.exit(1)

    print("EVAL PASSED")
    sys.exit(0)

Summary

A custom evaluation framework consists of five components: a domain-specific task dataset, an LLM-as-judge scorer with rubrics, an async eval runner with aggregation, statistical tests for significance, and CI integration for automated regression detection. Build the dataset first — everything else is tooling around it. A well-maintained eval suite that runs on every PR is the single most reliable way to ensure your agent doesn't quietly regress as you iterate on prompts and models.