Testing AI Agents

Testing AI agents is harder than testing regular software. LLM outputs are non-deterministic, the reasoning loop has complex intermediate state, and failure modes include subtle behavioural issues that exact-match assertions will never catch. This lesson builds a layered testing strategy that gives you genuine confidence — not just green checkmarks — in your agent's behaviour.

Why Agent Testing Is Different

In conventional software, a function with the same inputs always produces the same output. In an agent:

The LLM may phrase the same answer differently on every run
A multi-step tool call sequence may reach the correct answer via a different path each time
Failures are often soft — the agent produces an answer, just the wrong one
Some bugs only emerge after 8+ reasoning iterations

This means your testing strategy must be layered:

Layer	What It Tests	Determinism
Unit tests	Individual tools, parsers, validators	Fully deterministic
Component tests	State transitions, dispatch logic	Deterministic
Integration tests	Full agent loop with mocked LLM	Deterministic
Trajectory tests	Full agent loop with real LLM	Non-deterministic — evaluate on rubric
Regression tests	Known inputs that previously caused bugs	Deterministic with mocks

Start with layers 1–3, which are fast and cheap. Add trajectory tests for critical user-facing flows.

Setting Up the Test Environment

pip install pytest pytest-asyncio pytest-mock respx freezegun

tests/
├── conftest.py           # Shared fixtures
├── unit/
│   ├── test_tools.py     # Tool function tests
│   ├── test_state.py     # AgentState and transitions
│   └── test_dispatch.py  # ToolDispatcher validation logic
├── integration/
│   ├── test_agent_loop.py    # Full loop with mocked LLM
│   └── test_tool_registry.py # Registry behaviour
└── trajectory/
    └── test_research_task.py  # End-to-end with real LLM

Unit Testing Individual Tools

Tool functions should be pure Python — no LLM involvement, no global state. This makes them easy to unit test:

# tests/unit/test_tools.py
import pytest
from agent.tools.calculator import calculate
from agent.tools.file_reader import read_file
from pathlib import Path
import tempfile


class TestCalculateTool:
    """Unit tests for the calculate tool."""

    def test_basic_addition(self):
        assert calculate("2 + 2") == "4.0"

    def test_multiplication(self):
        assert calculate("12 * 7") == "84.0"

    def test_division(self):
        assert calculate("100 / 4") == "25.0"

    def test_exponentiation(self):
        assert calculate("2 ** 10") == "1024.0"

    def test_nested_expression(self):
        result = calculate("(10 + 5) * 3")
        assert result == "45.0"

    def test_invalid_expression_returns_error(self):
        result = calculate("import os; os.system('rm -rf /')")
        assert "Error" in result

    def test_empty_expression_returns_error(self):
        result = calculate("")
        assert "Error" in result

    def test_division_by_zero(self):
        result = calculate("1 / 0")
        assert "Error" in result or "inf" in result.lower()


class TestReadFileTool:
    """Unit tests for the read_file tool."""

    def test_reads_existing_file(self, tmp_path):
        target = tmp_path / "sample.txt"
        target.write_text("Hello, agent!")
        result = read_file(str(target))
        assert result == "Hello, agent!"

    def test_missing_file_returns_error(self):
        result = read_file("/nonexistent/path/file.txt")
        assert "Error" in result
        assert "not found" in result

    def test_large_file_returns_error(self, tmp_path):
        large_file = tmp_path / "large.bin"
        large_file.write_bytes(b"x" * 2_000_000)  # 2 MB
        result = read_file(str(large_file))
        assert "Error" in result
        assert "too large" in result

Testing State Transitions

State machine logic is deterministic and should have 100% test coverage:

# tests/unit/test_state.py
import pytest
from agent.state import AgentState, AgentPhase, Role, Message


class TestAgentPhaseTransitions:
    """Verify that phase transitions enforce the correct sequence."""

    def test_initial_state_is_planning(self):
        state = AgentState.from_user_input("hello")
        assert state.phase == AgentPhase.PLANNING

    def test_planning_transitions_to_executing(self):
        state = AgentState.from_user_input("task")
        new_state = state.to_executing()
        assert new_state.phase == AgentPhase.EXECUTING

    def test_executing_transitions_to_reflecting(self):
        state = AgentState.from_user_input("task").to_executing()
        new_state = state.to_reflecting()
        assert new_state.phase == AgentPhase.REFLECTING

    def test_done_sets_final_answer(self):
        state = AgentState.from_user_input("task")
        done_state = state.to_done("The answer is 42.")
        assert done_state.phase == AgentPhase.DONE
        assert done_state.final_answer == "The answer is 42."

    def test_invalid_transition_raises_assertion(self):
        state = AgentState.from_user_input("task")
        # Cannot go from PLANNING → REFLECTING (must go through EXECUTING first)
        with pytest.raises(AssertionError):
            state.to_reflecting()

    def test_transitions_increment_iteration(self):
        state = AgentState.from_user_input("task")
        assert state.iteration == 0
        state = state.to_executing()
        assert state.iteration == 1
        state = state.to_reflecting()
        assert state.iteration == 2

    def test_transitions_do_not_mutate_original(self):
        original = AgentState.from_user_input("task")
        new_state = original.to_executing()
        # Original must not be modified
        assert original.phase == AgentPhase.PLANNING
        assert new_state.phase == AgentPhase.EXECUTING

    def test_user_message_recorded_in_history(self):
        state = AgentState.from_user_input("What is the capital of France?")
        assert len(state.history) == 1
        assert state.history[0].role == Role.USER
        assert "France" in state.history[0].content

Mocking the LLM for Integration Tests

The key to fast, reliable integration tests is a mock LLM that returns predefined responses in sequence. You can script exactly what the LLM "says" on each call:

# tests/conftest.py
import pytest
from unittest.mock import MagicMock, patch
from typing import Iterator


class MockLLM:
    """
    A scriptable mock LLM for use in integration tests.

    Responses are consumed from the queue in order. If the queue is
    exhausted, raises StopIteration — this is intentional so tests
    fail loudly if the agent makes more LLM calls than expected.
    """

    def __init__(self, responses: list[dict]) -> None:
        self._responses: Iterator[dict] = iter(responses)

    def complete(self, messages: list[dict], tools: list = None) -> dict:
        """Return the next scripted response."""
        try:
            return next(self._responses)
        except StopIteration:
            raise AssertionError(
                "MockLLM exhausted: the agent made more LLM calls than expected. "
                "Add more responses to the mock queue."
            )

    def count_tokens(self, text: str) -> int:
        return len(text.split())


@pytest.fixture
def mock_llm_tool_then_answer():
    """A mock LLM that first calls a tool, then gives a final answer."""
    return MockLLM(responses=[
        {
            "finish_reason": "tool_calls",
            "tool_calls": [
                {"name": "calculate", "args": {"expression": "100 * 0.15"}}
            ],
        },
        {
            "finish_reason": "stop",
            "content": "15% of 100 is 15.",
        },
    ])


@pytest.fixture
def mock_llm_direct_answer():
    """A mock LLM that answers immediately without using any tools."""
    return MockLLM(responses=[
        {
            "finish_reason": "stop",
            "content": "The capital of France is Paris.",
        }
    ])

Integration Test: Full Agent Loop

# tests/integration/test_agent_loop.py
import pytest
from agent.orchestrator import AgentOrchestrator
from agent.state import AgentPhase
from unittest.mock import MagicMock


class TestAgentLoop:
    """Integration tests for the AgentOrchestrator reasoning loop."""

    def test_direct_answer_requires_one_llm_call(
        self, mock_llm_direct_answer
    ):
        """Agent should stop immediately when the LLM gives a direct answer."""
        agent = AgentOrchestrator(
            llm=mock_llm_direct_answer,
            memory=MagicMock(),
            tools=[],
            max_iterations=5,
        )
        from agent.state import AgentRequest
        response = agent.handle(AgentRequest(
            user_id="u1",
            session_id="s1",
            message="What is the capital of France?",
        ))
        assert response.success
        assert "Paris" in response.content
        assert response.iterations == 1
        assert response.tool_calls_made == []

    def test_tool_call_recorded_in_response(
        self, mock_llm_tool_then_answer
    ):
        """Agent should record which tools were used."""
        from agent.tools.calculator import calculate
        from tests.conftest import MockTool

        mock_tool = MockTool(name="calculate", fn=calculate)
        agent = AgentOrchestrator(
            llm=mock_llm_tool_then_answer,
            memory=MagicMock(),
            tools=[mock_tool],
            max_iterations=5,
        )
        from agent.state import AgentRequest
        response = agent.handle(AgentRequest(
            user_id="u1",
            session_id="s1",
            message="What is 15% of 100?",
        ))
        assert response.success
        assert "calculate" in response.tool_calls_made

    def test_max_iterations_returns_failure(self):
        """Agent should return a failure response when it hits the iteration cap."""
        # Mock LLM that always calls a tool — never terminates
        infinite_llm = MockLLM(responses=[
            {
                "finish_reason": "tool_calls",
                "tool_calls": [{"name": "search_web", "args": {"query": "test"}}],
            }
        ] * 20)

        agent = AgentOrchestrator(
            llm=infinite_llm,
            memory=MagicMock(),
            tools=[],
            max_iterations=3,
        )
        from agent.state import AgentRequest
        response = agent.handle(AgentRequest(
            user_id="u1",
            session_id="s1",
            message="Loop forever.",
        ))
        assert not response.success
        assert response.error == "Max iterations exceeded"
        assert response.iterations == 3

Deterministic Test Fixtures with Freezegun

Time-sensitive state can make tests flaky. Freeze time to make timestamps deterministic:

# tests/unit/test_state_timestamps.py
from freezegun import freeze_time
from datetime import datetime, timezone
from agent.state import AgentState


@freeze_time("2025-01-15 10:00:00")
def test_state_created_at_is_frozen_time():
    state = AgentState.from_user_input("test")
    expected = datetime(2025, 1, 15, 10, 0, 0, tzinfo=timezone.utc)
    assert state.created_at == expected


@freeze_time("2025-01-15 10:00:00")
def test_checkpoint_roundtrip_preserves_timestamps(tmp_path):
    """Saved and loaded state should have identical timestamps."""
    from agent.state import save_checkpoint, load_checkpoint

    state = AgentState.from_user_input("roundtrip test")
    save_checkpoint(state, tmp_path)
    loaded = load_checkpoint(state.session_id, tmp_path)
    assert loaded.created_at == state.created_at
    assert loaded.updated_at == state.updated_at

Trajectory Testing with a Real LLM

For critical user flows, trajectory tests use a real LLM and evaluate the output on a rubric rather than checking exact strings:

# tests/trajectory/test_research_task.py
import pytest
import os

# Skip trajectory tests unless the OPENAI_API_KEY env var is set
pytestmark = pytest.mark.skipif(
    not os.getenv("OPENAI_API_KEY"),
    reason="Trajectory tests require OPENAI_API_KEY",
)


def evaluate_answer(answer: str, criteria: list[str]) -> dict[str, bool]:
    """
    Simple rubric evaluator — checks whether each criterion phrase
    appears in the answer. In production, use an LLM-as-judge approach
    for more nuanced evaluation.
    """
    return {
        criterion: criterion.lower() in answer.lower()
        for criterion in criteria
    }


class TestResearchTrajectory:
    """End-to-end trajectory tests. Marked slow — run with pytest -m slow."""

    @pytest.mark.slow
    def test_capital_city_research(self, production_agent):
        """Agent should correctly identify and explain a capital city."""
        response = production_agent.handle_message(
            "What is the capital of Australia, and why is it not Sydney?"
        )
        scores = evaluate_answer(response, [
            "Canberra",
            "Sydney",
            "compromise",
        ])
        # At least 2 of 3 criteria must be present
        assert sum(scores.values()) >= 2, (
            f"Answer failed rubric evaluation.\n"
            f"Answer: {response}\n"
            f"Criteria scores: {scores}"
        )

    @pytest.mark.slow
    def test_calculation_chain(self, production_agent):
        """Agent should use the calculator tool for multi-step arithmetic."""
        response = production_agent.handle_message(
            "I have $1000. I invest it at 5% annual interest for 3 years. "
            "How much do I have? Show your calculation."
        )
        # The answer should be in the range $1150–$1160 (compound interest)
        import re
        numbers = [float(n.replace(",", "")) for n in re.findall(r"[\d,]+\.?\d*", response)]
        assert any(1150 <= n <= 1160 for n in numbers), (
            f"Expected an answer near $1157.63. Got: {response}"
        )

Test Isolation: Preventing State Leakage

Agent tests that share state can produce mysterious, order-dependent failures. Use pytest fixtures to guarantee a clean slate:

# tests/conftest.py (additions)
import pytest
from pathlib import Path
import shutil


@pytest.fixture(autouse=True)
def clean_checkpoint_dir(tmp_path, monkeypatch):
    """
    Redirect checkpoint directory to a fresh temp directory for each test.
    Prevents checkpoint files from one test bleeding into another.
    """
    from agent import state as state_module
    monkeypatch.setattr(state_module, "CHECKPOINT_DIR", tmp_path)
    yield
    # tmp_path is automatically cleaned up by pytest after each test


@pytest.fixture
def isolated_registry():
    """Return a fresh ToolRegistry instance for each test."""
    from agent.tools.registry import ToolRegistry
    return ToolRegistry()

Pytest Configuration

# pytest.ini
[pytest]
testpaths = tests
addopts = -v --tb=short
markers =
    slow: marks tests as slow (use -m "not slow" to skip in CI)
    trajectory: marks end-to-end trajectory tests requiring a real LLM
    integration: marks integration tests
    unit: marks unit tests

asyncio_mode = auto

CI Tip: Run unit and integration tests on every commit (pytest -m "not slow"). Run trajectory tests nightly or on pull requests to main. This keeps your feedback loop fast while still catching LLM-level regressions.

Key Takeaways

Use a layered testing strategy: unit → component → integration (mocked LLM) → trajectory (real LLM).
Mock the LLM for integration tests using a scripted response queue — this makes loops deterministic and fast.
Test state machine transitions exhaustively — they are the backbone of your agent's control flow.
Use rubric evaluation for trajectory tests rather than exact string matching.
Isolate test state with pytest fixtures to prevent order-dependent failures.
Run slow tests in CI nightly, not on every commit.