Automated Evaluation Frameworks

Manually reviewing LLM outputs doesn't scale. Once you have more than a few dozen test cases, you need automated scoring. This lesson covers LLM-as-judge evaluation — using a second LLM to score the first — along with established frameworks like Promptfoo and LangSmith.

LLM-as-Judge

The most flexible approach to automated evaluation is using an LLM to assess the quality of another LLM's output. The judge LLM reads the original prompt, the response, and a rubric, then produces a numeric score or categorical rating.

from openai import OpenAI
from pydantic import BaseModel
from typing import Literal

client = OpenAI()

class JudgeVerdict(BaseModel):
    score: int  # 1-5
    reasoning: str
    specific_issues: list[str]
    verdict: Literal["pass", "fail"]

def llm_judge(
    original_question: str,
    ai_response: str,
    rubric: str,
    passing_score: int = 4,
) -> JudgeVerdict:
    """
    Use GPT-4 (a stronger model) to judge outputs from GPT-4o-mini.
    
    Using a more capable model as judge improves reliability.
    Always use a model at least as capable as the one being evaluated.
    """
    judge_prompt = f"""You are an impartial evaluator. Score the AI response on a 1-5 scale.

RUBRIC:
{rubric}

SCORING:
5 = Excellent: Fully satisfies the rubric with no meaningful issues
4 = Good: Satisfies the rubric with minor issues
3 = Acceptable: Partially satisfies the rubric with notable gaps
2 = Poor: Fails to satisfy most rubric criteria
1 = Unacceptable: Completely fails or is harmful

ORIGINAL QUESTION:
{original_question}

AI RESPONSE TO EVALUATE:
{ai_response}

Evaluate the response against the rubric and provide your verdict."""

    completion = client.beta.chat.completions.parse(
        model="gpt-4o",  # Stronger judge than the model being evaluated
        messages=[
            {"role": "system", "content": "You are an objective evaluator. Be critical and precise."},
            {"role": "user", "content": judge_prompt}
        ],
        response_format=JudgeVerdict,
    )
    
    verdict = completion.choices[0].message.parsed
    verdict.verdict = "pass" if verdict.score >= passing_score else "fail"
    return verdict


# Example usage: evaluating a technical explanation
rubric = """
ACCURACY (2 points): Technical information must be correct. No hallucinations.
CLARITY (1 point): Explanation should be understandable to a junior developer.
COMPLETENESS (1 point): Must address all aspects of the question.
CODE QUALITY (1 point): If code is provided, it must be syntactically correct and idiomatic.
"""

result = llm_judge(
    original_question="Explain Python decorators with an example.",
    ai_response="""Decorators in Python are functions that modify other functions...
```python
def my_decorator(func):
    def wrapper(*args, **kwargs):
        print("Before")
        result = func(*args, **kwargs)
        print("After")
        return result
    return wrapper

@my_decorator
def greet(name):
    print(f"Hello, {name}")

greet("Alice")
```""",
    rubric=rubric,
)

print(f"Score: {result.score}/5 ({result.verdict})")
print(f"Reasoning: {result.reasoning}")

Known LLM-Judge Biases

Be aware of these biases when designing judge prompts:

Bias	Description	Mitigation
Position bias	Prefers first option in A/B tests	Randomize order, run both orderings
Verbosity bias	Prefers longer responses	Include length as an explicit rubric dimension
Self-bias	Judge prefers outputs from its own model family	Use a different model family as judge
Sycophancy	Judge agrees with whatever is stated	Use explicit rubrics, not open-ended scoring

Promptfoo: Open-Source Eval Framework

Promptfoo is a popular open-source tool for systematic prompt evaluation:

npm install -g promptfoo
# or
pip install promptfoo

# promptfooconfig.yaml
description: "Sentiment classifier evaluation"

providers:
  - id: openai:gpt-4o-mini
    config:
      temperature: 0

prompts:
  - "Classify the sentiment of this text as 'positive', 'negative', 'neutral', or 'mixed'.\n\nText: {{input}}"

tests:
  - vars:
      input: "I absolutely love this product!"
    assert:
      - type: equals
        value: "positive"
      - type: javascript
        value: "output.length < 20"
  
  - vars:
      input: "Terrible quality, waste of money."
    assert:
      - type: equals
        value: "negative"
  
  - vars:
      input: "Great features but terrible support."
    assert:
      - type: equals
        value: "mixed"
      - type: llm-rubric
        value: "Response correctly identifies both positive (features) and negative (support) aspects"

promptfoo eval
promptfoo view  # Opens web UI with results

LangSmith: Production Eval Pipeline

For LangChain-based applications, LangSmith provides integrated tracing and evaluation:

from langsmith import Client
from langsmith.evaluation import evaluate

langsmith_client = Client()

# Create a dataset
dataset = langsmith_client.create_dataset("sentiment-eval-v1")
langsmith_client.create_examples(
    inputs=[{"text": "Great product!"}, {"text": "Terrible experience."}],
    outputs=[{"sentiment": "positive"}, {"sentiment": "negative"}],
    dataset_id=dataset.id,
)

# Define the target function (what you're evaluating)
def target_function(inputs: dict) -> dict:
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    response = llm.invoke(f"Classify sentiment (positive/negative/neutral/mixed): {inputs['text']}")
    return {"sentiment": response.content.strip().lower()}

# Run evaluation
results = evaluate(
    target_function,
    data="sentiment-eval-v1",
    evaluators=["exact_match"],  # Built-in evaluator
)

print(f"Pass rate: {results.results['pass_rate']:.1%}")

Automated evals don't replace human review — they scale it. Run automated evals on every prompt change to catch regressions, then do focused human review on the cases that fail.