Deployment and Testing

Testing AI Agents

14m read

Testing AI Agents

Testing AI agents is harder than testing regular software. LLM outputs are non-deterministic, the reasoning loop has complex intermediate state, and failure modes include subtle behavioural issues that exact-match assertions will never catch. This lesson builds a layered testing strategy that gives you genuine confidence — not just green checkmarks — in your agent's behaviour.


Why Agent Testing Is Different

In conventional software, a function with the same inputs always produces the same output. In an agent:

  • The LLM may phrase the same answer differently on every run
  • A multi-step tool call sequence may reach the correct answer via a different path each time
  • Failures are often soft — the agent produces an answer, just the wrong one
  • Some bugs only emerge after 8+ reasoning iterations

This means your testing strategy must be layered:

LayerWhat It TestsDeterminism
Unit testsIndividual tools, parsers, validatorsFully deterministic
Component testsState transitions, dispatch logicDeterministic
Integration testsFull agent loop with mocked LLMDeterministic
Trajectory testsFull agent loop with real LLMNon-deterministic — evaluate on rubric
Regression testsKnown inputs that previously caused bugsDeterministic with mocks

Start with layers 1–3, which are fast and cheap. Add trajectory tests for critical user-facing flows.


Setting Up the Test Environment

pip install pytest pytest-asyncio pytest-mock respx freezegun
tests/
├── conftest.py           # Shared fixtures
├── unit/
│   ├── test_tools.py     # Tool function tests
│   ├── test_state.py     # AgentState and transitions
│   └── test_dispatch.py  # ToolDispatcher validation logic
├── integration/
│   ├── test_agent_loop.py    # Full loop with mocked LLM
│   └── test_tool_registry.py # Registry behaviour
└── trajectory/
    └── test_research_task.py  # End-to-end with real LLM

Unit Testing Individual Tools

Tool functions should be pure Python — no LLM involvement, no global state. This makes them easy to unit test:

# tests/unit/test_tools.py
import pytest
from agent.tools.calculator import calculate
from agent.tools.file_reader import read_file
from pathlib import Path
import tempfile


class TestCalculateTool:
    """Unit tests for the calculate tool."""

    def test_basic_addition(self):
        assert calculate("2 + 2") == "4.0"

    def test_multiplication(self):
        assert calculate("12 * 7") == "84.0"

    def test_division(self):
        assert calculate("100 / 4") == "25.0"

    def test_exponentiation(self):
        assert calculate("2 ** 10") == "1024.0"

    def test_nested_expression(self):
        result = calculate("(10 + 5) * 3")
        assert result == "45.0"

    def test_invalid_expression_returns_error(self):
        result = calculate("import os; os.system('rm -rf /')")
        assert "Error" in result

    def test_empty_expression_returns_error(self):
        result = calculate("")
        assert "Error" in result

    def test_division_by_zero(self):
        result = calculate("1 / 0")
        assert "Error" in result or "inf" in result.lower()


class TestReadFileTool:
    """Unit tests for the read_file tool."""

    def test_reads_existing_file(self, tmp_path):
        target = tmp_path / "sample.txt"
        target.write_text("Hello, agent!")
        result = read_file(str(target))
        assert result == "Hello, agent!"

    def test_missing_file_returns_error(self):
        result = read_file("/nonexistent/path/file.txt")
        assert "Error" in result
        assert "not found" in result

    def test_large_file_returns_error(self, tmp_path):
        large_file = tmp_path / "large.bin"
        large_file.write_bytes(b"x" * 2_000_000)  # 2 MB
        result = read_file(str(large_file))
        assert "Error" in result
        assert "too large" in result

Testing State Transitions

State machine logic is deterministic and should have 100% test coverage:

# tests/unit/test_state.py
import pytest
from agent.state import AgentState, AgentPhase, Role, Message


class TestAgentPhaseTransitions:
    """Verify that phase transitions enforce the correct sequence."""

    def test_initial_state_is_planning(self):
        state = AgentState.from_user_input("hello")
        assert state.phase == AgentPhase.PLANNING

    def test_planning_transitions_to_executing(self):
        state = AgentState.from_user_input("task")
        new_state = state.to_executing()
        assert new_state.phase == AgentPhase.EXECUTING

    def test_executing_transitions_to_reflecting(self):
        state = AgentState.from_user_input("task").to_executing()
        new_state = state.to_reflecting()
        assert new_state.phase == AgentPhase.REFLECTING

    def test_done_sets_final_answer(self):
        state = AgentState.from_user_input("task")
        done_state = state.to_done("The answer is 42.")
        assert done_state.phase == AgentPhase.DONE
        assert done_state.final_answer == "The answer is 42."

    def test_invalid_transition_raises_assertion(self):
        state = AgentState.from_user_input("task")
        # Cannot go from PLANNING → REFLECTING (must go through EXECUTING first)
        with pytest.raises(AssertionError):
            state.to_reflecting()

    def test_transitions_increment_iteration(self):
        state = AgentState.from_user_input("task")
        assert state.iteration == 0
        state = state.to_executing()
        assert state.iteration == 1
        state = state.to_reflecting()
        assert state.iteration == 2

    def test_transitions_do_not_mutate_original(self):
        original = AgentState.from_user_input("task")
        new_state = original.to_executing()
        # Original must not be modified
        assert original.phase == AgentPhase.PLANNING
        assert new_state.phase == AgentPhase.EXECUTING

    def test_user_message_recorded_in_history(self):
        state = AgentState.from_user_input("What is the capital of France?")
        assert len(state.history) == 1
        assert state.history[0].role == Role.USER
        assert "France" in state.history[0].content

Mocking the LLM for Integration Tests

The key to fast, reliable integration tests is a mock LLM that returns predefined responses in sequence. You can script exactly what the LLM "says" on each call:

# tests/conftest.py
import pytest
from unittest.mock import MagicMock, patch
from typing import Iterator


class MockLLM:
    """
    A scriptable mock LLM for use in integration tests.

    Responses are consumed from the queue in order. If the queue is
    exhausted, raises StopIteration — this is intentional so tests
    fail loudly if the agent makes more LLM calls than expected.
    """

    def __init__(self, responses: list[dict]) -> None:
        self._responses: Iterator[dict] = iter(responses)

    def complete(self, messages: list[dict], tools: list = None) -> dict:
        """Return the next scripted response."""
        try:
            return next(self._responses)
        except StopIteration:
            raise AssertionError(
                "MockLLM exhausted: the agent made more LLM calls than expected. "
                "Add more responses to the mock queue."
            )

    def count_tokens(self, text: str) -> int:
        return len(text.split())


@pytest.fixture
def mock_llm_tool_then_answer():
    """A mock LLM that first calls a tool, then gives a final answer."""
    return MockLLM(responses=[
        {
            "finish_reason": "tool_calls",
            "tool_calls": [
                {"name": "calculate", "args": {"expression": "100 * 0.15"}}
            ],
        },
        {
            "finish_reason": "stop",
            "content": "15% of 100 is 15.",
        },
    ])


@pytest.fixture
def mock_llm_direct_answer():
    """A mock LLM that answers immediately without using any tools."""
    return MockLLM(responses=[
        {
            "finish_reason": "stop",
            "content": "The capital of France is Paris.",
        }
    ])

Integration Test: Full Agent Loop

# tests/integration/test_agent_loop.py
import pytest
from agent.orchestrator import AgentOrchestrator
from agent.state import AgentPhase
from unittest.mock import MagicMock


class TestAgentLoop:
    """Integration tests for the AgentOrchestrator reasoning loop."""

    def test_direct_answer_requires_one_llm_call(
        self, mock_llm_direct_answer
    ):
        """Agent should stop immediately when the LLM gives a direct answer."""
        agent = AgentOrchestrator(
            llm=mock_llm_direct_answer,
            memory=MagicMock(),
            tools=[],
            max_iterations=5,
        )
        from agent.state import AgentRequest
        response = agent.handle(AgentRequest(
            user_id="u1",
            session_id="s1",
            message="What is the capital of France?",
        ))
        assert response.success
        assert "Paris" in response.content
        assert response.iterations == 1
        assert response.tool_calls_made == []

    def test_tool_call_recorded_in_response(
        self, mock_llm_tool_then_answer
    ):
        """Agent should record which tools were used."""
        from agent.tools.calculator import calculate
        from tests.conftest import MockTool

        mock_tool = MockTool(name="calculate", fn=calculate)
        agent = AgentOrchestrator(
            llm=mock_llm_tool_then_answer,
            memory=MagicMock(),
            tools=[mock_tool],
            max_iterations=5,
        )
        from agent.state import AgentRequest
        response = agent.handle(AgentRequest(
            user_id="u1",
            session_id="s1",
            message="What is 15% of 100?",
        ))
        assert response.success
        assert "calculate" in response.tool_calls_made

    def test_max_iterations_returns_failure(self):
        """Agent should return a failure response when it hits the iteration cap."""
        # Mock LLM that always calls a tool — never terminates
        infinite_llm = MockLLM(responses=[
            {
                "finish_reason": "tool_calls",
                "tool_calls": [{"name": "search_web", "args": {"query": "test"}}],
            }
        ] * 20)

        agent = AgentOrchestrator(
            llm=infinite_llm,
            memory=MagicMock(),
            tools=[],
            max_iterations=3,
        )
        from agent.state import AgentRequest
        response = agent.handle(AgentRequest(
            user_id="u1",
            session_id="s1",
            message="Loop forever.",
        ))
        assert not response.success
        assert response.error == "Max iterations exceeded"
        assert response.iterations == 3

Deterministic Test Fixtures with Freezegun

Time-sensitive state can make tests flaky. Freeze time to make timestamps deterministic:

# tests/unit/test_state_timestamps.py
from freezegun import freeze_time
from datetime import datetime, timezone
from agent.state import AgentState


@freeze_time("2025-01-15 10:00:00")
def test_state_created_at_is_frozen_time():
    state = AgentState.from_user_input("test")
    expected = datetime(2025, 1, 15, 10, 0, 0, tzinfo=timezone.utc)
    assert state.created_at == expected


@freeze_time("2025-01-15 10:00:00")
def test_checkpoint_roundtrip_preserves_timestamps(tmp_path):
    """Saved and loaded state should have identical timestamps."""
    from agent.state import save_checkpoint, load_checkpoint

    state = AgentState.from_user_input("roundtrip test")
    save_checkpoint(state, tmp_path)
    loaded = load_checkpoint(state.session_id, tmp_path)
    assert loaded.created_at == state.created_at
    assert loaded.updated_at == state.updated_at

Trajectory Testing with a Real LLM

For critical user flows, trajectory tests use a real LLM and evaluate the output on a rubric rather than checking exact strings:

# tests/trajectory/test_research_task.py
import pytest
import os

# Skip trajectory tests unless the OPENAI_API_KEY env var is set
pytestmark = pytest.mark.skipif(
    not os.getenv("OPENAI_API_KEY"),
    reason="Trajectory tests require OPENAI_API_KEY",
)


def evaluate_answer(answer: str, criteria: list[str]) -> dict[str, bool]:
    """
    Simple rubric evaluator — checks whether each criterion phrase
    appears in the answer. In production, use an LLM-as-judge approach
    for more nuanced evaluation.
    """
    return {
        criterion: criterion.lower() in answer.lower()
        for criterion in criteria
    }


class TestResearchTrajectory:
    """End-to-end trajectory tests. Marked slow — run with pytest -m slow."""

    @pytest.mark.slow
    def test_capital_city_research(self, production_agent):
        """Agent should correctly identify and explain a capital city."""
        response = production_agent.handle_message(
            "What is the capital of Australia, and why is it not Sydney?"
        )
        scores = evaluate_answer(response, [
            "Canberra",
            "Sydney",
            "compromise",
        ])
        # At least 2 of 3 criteria must be present
        assert sum(scores.values()) >= 2, (
            f"Answer failed rubric evaluation.\n"
            f"Answer: {response}\n"
            f"Criteria scores: {scores}"
        )

    @pytest.mark.slow
    def test_calculation_chain(self, production_agent):
        """Agent should use the calculator tool for multi-step arithmetic."""
        response = production_agent.handle_message(
            "I have $1000. I invest it at 5% annual interest for 3 years. "
            "How much do I have? Show your calculation."
        )
        # The answer should be in the range $1150–$1160 (compound interest)
        import re
        numbers = [float(n.replace(",", "")) for n in re.findall(r"[\d,]+\.?\d*", response)]
        assert any(1150 <= n <= 1160 for n in numbers), (
            f"Expected an answer near $1157.63. Got: {response}"
        )

Test Isolation: Preventing State Leakage

Agent tests that share state can produce mysterious, order-dependent failures. Use pytest fixtures to guarantee a clean slate:

# tests/conftest.py (additions)
import pytest
from pathlib import Path
import shutil


@pytest.fixture(autouse=True)
def clean_checkpoint_dir(tmp_path, monkeypatch):
    """
    Redirect checkpoint directory to a fresh temp directory for each test.
    Prevents checkpoint files from one test bleeding into another.
    """
    from agent import state as state_module
    monkeypatch.setattr(state_module, "CHECKPOINT_DIR", tmp_path)
    yield
    # tmp_path is automatically cleaned up by pytest after each test


@pytest.fixture
def isolated_registry():
    """Return a fresh ToolRegistry instance for each test."""
    from agent.tools.registry import ToolRegistry
    return ToolRegistry()

Pytest Configuration

# pytest.ini
[pytest]
testpaths = tests
addopts = -v --tb=short
markers =
    slow: marks tests as slow (use -m "not slow" to skip in CI)
    trajectory: marks end-to-end trajectory tests requiring a real LLM
    integration: marks integration tests
    unit: marks unit tests

asyncio_mode = auto

CI Tip: Run unit and integration tests on every commit (pytest -m "not slow"). Run trajectory tests nightly or on pull requests to main. This keeps your feedback loop fast while still catching LLM-level regressions.


Key Takeaways

  • Use a layered testing strategy: unit → component → integration (mocked LLM) → trajectory (real LLM).
  • Mock the LLM for integration tests using a scripted response queue — this makes loops deterministic and fast.
  • Test state machine transitions exhaustively — they are the backbone of your agent's control flow.
  • Use rubric evaluation for trajectory tests rather than exact string matching.
  • Isolate test state with pytest fixtures to prevent order-dependent failures.
  • Run slow tests in CI nightly, not on every commit.

Further Reading