Testing AI Agents
Testing AI agents is harder than testing regular software. LLM outputs are non-deterministic, the reasoning loop has complex intermediate state, and failure modes include subtle behavioural issues that exact-match assertions will never catch. This lesson builds a layered testing strategy that gives you genuine confidence — not just green checkmarks — in your agent's behaviour.
Why Agent Testing Is Different
In conventional software, a function with the same inputs always produces the same output. In an agent:
- The LLM may phrase the same answer differently on every run
- A multi-step tool call sequence may reach the correct answer via a different path each time
- Failures are often soft — the agent produces an answer, just the wrong one
- Some bugs only emerge after 8+ reasoning iterations
This means your testing strategy must be layered:
| Layer | What It Tests | Determinism |
|---|---|---|
| Unit tests | Individual tools, parsers, validators | Fully deterministic |
| Component tests | State transitions, dispatch logic | Deterministic |
| Integration tests | Full agent loop with mocked LLM | Deterministic |
| Trajectory tests | Full agent loop with real LLM | Non-deterministic — evaluate on rubric |
| Regression tests | Known inputs that previously caused bugs | Deterministic with mocks |
Start with layers 1–3, which are fast and cheap. Add trajectory tests for critical user-facing flows.
Setting Up the Test Environment
pip install pytest pytest-asyncio pytest-mock respx freezegun
tests/
├── conftest.py # Shared fixtures
├── unit/
│ ├── test_tools.py # Tool function tests
│ ├── test_state.py # AgentState and transitions
│ └── test_dispatch.py # ToolDispatcher validation logic
├── integration/
│ ├── test_agent_loop.py # Full loop with mocked LLM
│ └── test_tool_registry.py # Registry behaviour
└── trajectory/
└── test_research_task.py # End-to-end with real LLM
Unit Testing Individual Tools
Tool functions should be pure Python — no LLM involvement, no global state. This makes them easy to unit test:
# tests/unit/test_tools.py
import pytest
from agent.tools.calculator import calculate
from agent.tools.file_reader import read_file
from pathlib import Path
import tempfile
class TestCalculateTool:
"""Unit tests for the calculate tool."""
def test_basic_addition(self):
assert calculate("2 + 2") == "4.0"
def test_multiplication(self):
assert calculate("12 * 7") == "84.0"
def test_division(self):
assert calculate("100 / 4") == "25.0"
def test_exponentiation(self):
assert calculate("2 ** 10") == "1024.0"
def test_nested_expression(self):
result = calculate("(10 + 5) * 3")
assert result == "45.0"
def test_invalid_expression_returns_error(self):
result = calculate("import os; os.system('rm -rf /')")
assert "Error" in result
def test_empty_expression_returns_error(self):
result = calculate("")
assert "Error" in result
def test_division_by_zero(self):
result = calculate("1 / 0")
assert "Error" in result or "inf" in result.lower()
class TestReadFileTool:
"""Unit tests for the read_file tool."""
def test_reads_existing_file(self, tmp_path):
target = tmp_path / "sample.txt"
target.write_text("Hello, agent!")
result = read_file(str(target))
assert result == "Hello, agent!"
def test_missing_file_returns_error(self):
result = read_file("/nonexistent/path/file.txt")
assert "Error" in result
assert "not found" in result
def test_large_file_returns_error(self, tmp_path):
large_file = tmp_path / "large.bin"
large_file.write_bytes(b"x" * 2_000_000) # 2 MB
result = read_file(str(large_file))
assert "Error" in result
assert "too large" in result
Testing State Transitions
State machine logic is deterministic and should have 100% test coverage:
# tests/unit/test_state.py
import pytest
from agent.state import AgentState, AgentPhase, Role, Message
class TestAgentPhaseTransitions:
"""Verify that phase transitions enforce the correct sequence."""
def test_initial_state_is_planning(self):
state = AgentState.from_user_input("hello")
assert state.phase == AgentPhase.PLANNING
def test_planning_transitions_to_executing(self):
state = AgentState.from_user_input("task")
new_state = state.to_executing()
assert new_state.phase == AgentPhase.EXECUTING
def test_executing_transitions_to_reflecting(self):
state = AgentState.from_user_input("task").to_executing()
new_state = state.to_reflecting()
assert new_state.phase == AgentPhase.REFLECTING
def test_done_sets_final_answer(self):
state = AgentState.from_user_input("task")
done_state = state.to_done("The answer is 42.")
assert done_state.phase == AgentPhase.DONE
assert done_state.final_answer == "The answer is 42."
def test_invalid_transition_raises_assertion(self):
state = AgentState.from_user_input("task")
# Cannot go from PLANNING → REFLECTING (must go through EXECUTING first)
with pytest.raises(AssertionError):
state.to_reflecting()
def test_transitions_increment_iteration(self):
state = AgentState.from_user_input("task")
assert state.iteration == 0
state = state.to_executing()
assert state.iteration == 1
state = state.to_reflecting()
assert state.iteration == 2
def test_transitions_do_not_mutate_original(self):
original = AgentState.from_user_input("task")
new_state = original.to_executing()
# Original must not be modified
assert original.phase == AgentPhase.PLANNING
assert new_state.phase == AgentPhase.EXECUTING
def test_user_message_recorded_in_history(self):
state = AgentState.from_user_input("What is the capital of France?")
assert len(state.history) == 1
assert state.history[0].role == Role.USER
assert "France" in state.history[0].content
Mocking the LLM for Integration Tests
The key to fast, reliable integration tests is a mock LLM that returns predefined responses in sequence. You can script exactly what the LLM "says" on each call:
# tests/conftest.py
import pytest
from unittest.mock import MagicMock, patch
from typing import Iterator
class MockLLM:
"""
A scriptable mock LLM for use in integration tests.
Responses are consumed from the queue in order. If the queue is
exhausted, raises StopIteration — this is intentional so tests
fail loudly if the agent makes more LLM calls than expected.
"""
def __init__(self, responses: list[dict]) -> None:
self._responses: Iterator[dict] = iter(responses)
def complete(self, messages: list[dict], tools: list = None) -> dict:
"""Return the next scripted response."""
try:
return next(self._responses)
except StopIteration:
raise AssertionError(
"MockLLM exhausted: the agent made more LLM calls than expected. "
"Add more responses to the mock queue."
)
def count_tokens(self, text: str) -> int:
return len(text.split())
@pytest.fixture
def mock_llm_tool_then_answer():
"""A mock LLM that first calls a tool, then gives a final answer."""
return MockLLM(responses=[
{
"finish_reason": "tool_calls",
"tool_calls": [
{"name": "calculate", "args": {"expression": "100 * 0.15"}}
],
},
{
"finish_reason": "stop",
"content": "15% of 100 is 15.",
},
])
@pytest.fixture
def mock_llm_direct_answer():
"""A mock LLM that answers immediately without using any tools."""
return MockLLM(responses=[
{
"finish_reason": "stop",
"content": "The capital of France is Paris.",
}
])
Integration Test: Full Agent Loop
# tests/integration/test_agent_loop.py
import pytest
from agent.orchestrator import AgentOrchestrator
from agent.state import AgentPhase
from unittest.mock import MagicMock
class TestAgentLoop:
"""Integration tests for the AgentOrchestrator reasoning loop."""
def test_direct_answer_requires_one_llm_call(
self, mock_llm_direct_answer
):
"""Agent should stop immediately when the LLM gives a direct answer."""
agent = AgentOrchestrator(
llm=mock_llm_direct_answer,
memory=MagicMock(),
tools=[],
max_iterations=5,
)
from agent.state import AgentRequest
response = agent.handle(AgentRequest(
user_id="u1",
session_id="s1",
message="What is the capital of France?",
))
assert response.success
assert "Paris" in response.content
assert response.iterations == 1
assert response.tool_calls_made == []
def test_tool_call_recorded_in_response(
self, mock_llm_tool_then_answer
):
"""Agent should record which tools were used."""
from agent.tools.calculator import calculate
from tests.conftest import MockTool
mock_tool = MockTool(name="calculate", fn=calculate)
agent = AgentOrchestrator(
llm=mock_llm_tool_then_answer,
memory=MagicMock(),
tools=[mock_tool],
max_iterations=5,
)
from agent.state import AgentRequest
response = agent.handle(AgentRequest(
user_id="u1",
session_id="s1",
message="What is 15% of 100?",
))
assert response.success
assert "calculate" in response.tool_calls_made
def test_max_iterations_returns_failure(self):
"""Agent should return a failure response when it hits the iteration cap."""
# Mock LLM that always calls a tool — never terminates
infinite_llm = MockLLM(responses=[
{
"finish_reason": "tool_calls",
"tool_calls": [{"name": "search_web", "args": {"query": "test"}}],
}
] * 20)
agent = AgentOrchestrator(
llm=infinite_llm,
memory=MagicMock(),
tools=[],
max_iterations=3,
)
from agent.state import AgentRequest
response = agent.handle(AgentRequest(
user_id="u1",
session_id="s1",
message="Loop forever.",
))
assert not response.success
assert response.error == "Max iterations exceeded"
assert response.iterations == 3
Deterministic Test Fixtures with Freezegun
Time-sensitive state can make tests flaky. Freeze time to make timestamps deterministic:
# tests/unit/test_state_timestamps.py
from freezegun import freeze_time
from datetime import datetime, timezone
from agent.state import AgentState
@freeze_time("2025-01-15 10:00:00")
def test_state_created_at_is_frozen_time():
state = AgentState.from_user_input("test")
expected = datetime(2025, 1, 15, 10, 0, 0, tzinfo=timezone.utc)
assert state.created_at == expected
@freeze_time("2025-01-15 10:00:00")
def test_checkpoint_roundtrip_preserves_timestamps(tmp_path):
"""Saved and loaded state should have identical timestamps."""
from agent.state import save_checkpoint, load_checkpoint
state = AgentState.from_user_input("roundtrip test")
save_checkpoint(state, tmp_path)
loaded = load_checkpoint(state.session_id, tmp_path)
assert loaded.created_at == state.created_at
assert loaded.updated_at == state.updated_at
Trajectory Testing with a Real LLM
For critical user flows, trajectory tests use a real LLM and evaluate the output on a rubric rather than checking exact strings:
# tests/trajectory/test_research_task.py
import pytest
import os
# Skip trajectory tests unless the OPENAI_API_KEY env var is set
pytestmark = pytest.mark.skipif(
not os.getenv("OPENAI_API_KEY"),
reason="Trajectory tests require OPENAI_API_KEY",
)
def evaluate_answer(answer: str, criteria: list[str]) -> dict[str, bool]:
"""
Simple rubric evaluator — checks whether each criterion phrase
appears in the answer. In production, use an LLM-as-judge approach
for more nuanced evaluation.
"""
return {
criterion: criterion.lower() in answer.lower()
for criterion in criteria
}
class TestResearchTrajectory:
"""End-to-end trajectory tests. Marked slow — run with pytest -m slow."""
@pytest.mark.slow
def test_capital_city_research(self, production_agent):
"""Agent should correctly identify and explain a capital city."""
response = production_agent.handle_message(
"What is the capital of Australia, and why is it not Sydney?"
)
scores = evaluate_answer(response, [
"Canberra",
"Sydney",
"compromise",
])
# At least 2 of 3 criteria must be present
assert sum(scores.values()) >= 2, (
f"Answer failed rubric evaluation.\n"
f"Answer: {response}\n"
f"Criteria scores: {scores}"
)
@pytest.mark.slow
def test_calculation_chain(self, production_agent):
"""Agent should use the calculator tool for multi-step arithmetic."""
response = production_agent.handle_message(
"I have $1000. I invest it at 5% annual interest for 3 years. "
"How much do I have? Show your calculation."
)
# The answer should be in the range $1150–$1160 (compound interest)
import re
numbers = [float(n.replace(",", "")) for n in re.findall(r"[\d,]+\.?\d*", response)]
assert any(1150 <= n <= 1160 for n in numbers), (
f"Expected an answer near $1157.63. Got: {response}"
)
Test Isolation: Preventing State Leakage
Agent tests that share state can produce mysterious, order-dependent failures. Use pytest fixtures to guarantee a clean slate:
# tests/conftest.py (additions)
import pytest
from pathlib import Path
import shutil
@pytest.fixture(autouse=True)
def clean_checkpoint_dir(tmp_path, monkeypatch):
"""
Redirect checkpoint directory to a fresh temp directory for each test.
Prevents checkpoint files from one test bleeding into another.
"""
from agent import state as state_module
monkeypatch.setattr(state_module, "CHECKPOINT_DIR", tmp_path)
yield
# tmp_path is automatically cleaned up by pytest after each test
@pytest.fixture
def isolated_registry():
"""Return a fresh ToolRegistry instance for each test."""
from agent.tools.registry import ToolRegistry
return ToolRegistry()
Pytest Configuration
# pytest.ini
[pytest]
testpaths = tests
addopts = -v --tb=short
markers =
slow: marks tests as slow (use -m "not slow" to skip in CI)
trajectory: marks end-to-end trajectory tests requiring a real LLM
integration: marks integration tests
unit: marks unit tests
asyncio_mode = auto
CI Tip: Run unit and integration tests on every commit (
pytest -m "not slow"). Run trajectory tests nightly or on pull requests tomain. This keeps your feedback loop fast while still catching LLM-level regressions.
Key Takeaways
- Use a layered testing strategy: unit → component → integration (mocked LLM) → trajectory (real LLM).
- Mock the LLM for integration tests using a scripted response queue — this makes loops deterministic and fast.
- Test state machine transitions exhaustively — they are the backbone of your agent's control flow.
- Use rubric evaluation for trajectory tests rather than exact string matching.
- Isolate test state with pytest fixtures to prevent order-dependent failures.
- Run slow tests in CI nightly, not on every commit.
Further Reading
- pytest documentation — comprehensive testing framework reference
- LLM-as-judge evaluation — the paper behind using LLMs to evaluate LLM outputs
- Langsmith tracing — production tool for capturing and evaluating agent trajectories