Automated Evaluation Frameworks
Manually reviewing LLM outputs doesn't scale. Once you have more than a few dozen test cases, you need automated scoring. This lesson covers LLM-as-judge evaluation — using a second LLM to score the first — along with established frameworks like Promptfoo and LangSmith.
LLM-as-Judge
The most flexible approach to automated evaluation is using an LLM to assess the quality of another LLM's output. The judge LLM reads the original prompt, the response, and a rubric, then produces a numeric score or categorical rating.
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal
client = OpenAI()
class JudgeVerdict(BaseModel):
score: int # 1-5
reasoning: str
specific_issues: list[str]
verdict: Literal["pass", "fail"]
def llm_judge(
original_question: str,
ai_response: str,
rubric: str,
passing_score: int = 4,
) -> JudgeVerdict:
"""
Use GPT-4 (a stronger model) to judge outputs from GPT-4o-mini.
Using a more capable model as judge improves reliability.
Always use a model at least as capable as the one being evaluated.
"""
judge_prompt = f"""You are an impartial evaluator. Score the AI response on a 1-5 scale.
RUBRIC:
{rubric}
SCORING:
5 = Excellent: Fully satisfies the rubric with no meaningful issues
4 = Good: Satisfies the rubric with minor issues
3 = Acceptable: Partially satisfies the rubric with notable gaps
2 = Poor: Fails to satisfy most rubric criteria
1 = Unacceptable: Completely fails or is harmful
ORIGINAL QUESTION:
{original_question}
AI RESPONSE TO EVALUATE:
{ai_response}
Evaluate the response against the rubric and provide your verdict."""
completion = client.beta.chat.completions.parse(
model="gpt-4o", # Stronger judge than the model being evaluated
messages=[
{"role": "system", "content": "You are an objective evaluator. Be critical and precise."},
{"role": "user", "content": judge_prompt}
],
response_format=JudgeVerdict,
)
verdict = completion.choices[0].message.parsed
verdict.verdict = "pass" if verdict.score >= passing_score else "fail"
return verdict
# Example usage: evaluating a technical explanation
rubric = """
ACCURACY (2 points): Technical information must be correct. No hallucinations.
CLARITY (1 point): Explanation should be understandable to a junior developer.
COMPLETENESS (1 point): Must address all aspects of the question.
CODE QUALITY (1 point): If code is provided, it must be syntactically correct and idiomatic.
"""
result = llm_judge(
original_question="Explain Python decorators with an example.",
ai_response="""Decorators in Python are functions that modify other functions...
```python
def my_decorator(func):
def wrapper(*args, **kwargs):
print("Before")
result = func(*args, **kwargs)
print("After")
return result
return wrapper
@my_decorator
def greet(name):
print(f"Hello, {name}")
greet("Alice")
```""",
rubric=rubric,
)
print(f"Score: {result.score}/5 ({result.verdict})")
print(f"Reasoning: {result.reasoning}")
Known LLM-Judge Biases
Be aware of these biases when designing judge prompts:
| Bias | Description | Mitigation |
|---|---|---|
| Position bias | Prefers first option in A/B tests | Randomize order, run both orderings |
| Verbosity bias | Prefers longer responses | Include length as an explicit rubric dimension |
| Self-bias | Judge prefers outputs from its own model family | Use a different model family as judge |
| Sycophancy | Judge agrees with whatever is stated | Use explicit rubrics, not open-ended scoring |
Promptfoo: Open-Source Eval Framework
Promptfoo is a popular open-source tool for systematic prompt evaluation:
npm install -g promptfoo
# or
pip install promptfoo
# promptfooconfig.yaml
description: "Sentiment classifier evaluation"
providers:
- id: openai:gpt-4o-mini
config:
temperature: 0
prompts:
- "Classify the sentiment of this text as 'positive', 'negative', 'neutral', or 'mixed'.\n\nText: {{input}}"
tests:
- vars:
input: "I absolutely love this product!"
assert:
- type: equals
value: "positive"
- type: javascript
value: "output.length < 20"
- vars:
input: "Terrible quality, waste of money."
assert:
- type: equals
value: "negative"
- vars:
input: "Great features but terrible support."
assert:
- type: equals
value: "mixed"
- type: llm-rubric
value: "Response correctly identifies both positive (features) and negative (support) aspects"
promptfoo eval
promptfoo view # Opens web UI with results
LangSmith: Production Eval Pipeline
For LangChain-based applications, LangSmith provides integrated tracing and evaluation:
from langsmith import Client
from langsmith.evaluation import evaluate
langsmith_client = Client()
# Create a dataset
dataset = langsmith_client.create_dataset("sentiment-eval-v1")
langsmith_client.create_examples(
inputs=[{"text": "Great product!"}, {"text": "Terrible experience."}],
outputs=[{"sentiment": "positive"}, {"sentiment": "negative"}],
dataset_id=dataset.id,
)
# Define the target function (what you're evaluating)
def target_function(inputs: dict) -> dict:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
response = llm.invoke(f"Classify sentiment (positive/negative/neutral/mixed): {inputs['text']}")
return {"sentiment": response.content.strip().lower()}
# Run evaluation
results = evaluate(
target_function,
data="sentiment-eval-v1",
evaluators=["exact_match"], # Built-in evaluator
)
print(f"Pass rate: {results.results['pass_rate']:.1%}")
Automated evals don't replace human review — they scale it. Run automated evals on every prompt change to catch regressions, then do focused human review on the cases that fail.