Beyond ReAct

Language Agent Tree Search (LATS)

14m read

LATS: Language Agent Tree Search

From Linear to Arboreal Reasoning

Every agent pattern we've studied so far — ReAct, Plan-and-Execute, Reflexion — follows a fundamentally linear trajectory. The agent makes decisions, and if those decisions turn out to be wrong, it backtracks via replanning or reflection. But by the time it backtracks, it has already committed to a path and paid the cost of traversing it.

Language Agent Tree Search (LATS), introduced by Zhou et al. (2023), borrows the most powerful idea from game-playing AI — Monte Carlo Tree Search (MCTS) — and applies it to language agent decision-making. Rather than a single linear trajectory, LATS builds a tree of possible action sequences, intelligently exploring the most promising branches while pruning dead ends.

This gives agents something they've never had before: the ability to look ahead before committing, and to backtrack at any depth without losing information about other branches.


Background: Monte Carlo Tree Search

MCTS was famously used in AlphaGo. Its four phases apply directly to language agents:

                        ┌─────────────────┐
                        │   ROOT NODE     │
                        │  (initial state)│
                        └────────┬────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              ▼                  ▼                  ▼
         [Action A]         [Action B]         [Action C]
              │                  │                  │
         [A → A1]          [B → B1]            ✗ (pruned)
         [A → A2]          [B → B2]
              │
         [A1 → A1a]  ◀── MCTS selects this branch for expansion

The Four MCTS Phases

PhaseWhat HappensIn Language Agents
SelectionTraverse the tree using UCB1 to balance exploration vs. exploitationChoose which node (partial trajectory) to extend next
ExpansionGenerate child nodes from selected nodeSample K candidate next actions from the LLM
SimulationRoll out a complete trajectory from the new nodeRun the agent to completion from this state
BackpropagationUpdate node statistics up the treePropagate reward/value scores back to ancestors

The UCB1 selection formula used in LATS:

UCB1(node) = Q(node) / N(node) + C × √(ln(N(parent)) / N(node))

Where:

  • Q(node) = cumulative reward for this node
  • N(node) = number of times this node has been visited
  • C = exploration constant (typically √2)

High Q/N exploits known good paths. High √(ln(N_parent)/N) explores rarely-visited nodes.


LATS Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         LATS Agent                              │
│                                                                 │
│   ┌──────────┐    ┌───────────┐    ┌──────────┐   ┌─────────┐ │
│   │ Tree      │    │ Node      │    │ Reward   │   │ Memory  │ │
│   │ Manager  │◀──▶│ Expander  │◀──▶│ Model    │   │ Buffer  │ │
│   │ (MCTS)   │    │ (LLM)     │    │ (LLM)    │   │         │ │
│   └──────────┘    └───────────┘    └──────────┘   └─────────┘ │
│        │                                                        │
│   ┌────▼──────────────────────────────────────────────────┐    │
│   │                   Search Tree                          │    │
│   │  root ──▶ [step1_A] ──▶ [step2_A1] ──▶ [step3_A1a]   │    │
│   │       │                                               │    │
│   │       └──▶ [step1_B] ──▶ [step2_B1]  (pruned: -0.2)  │    │
│   │                      └──▶ [step2_B2] ──▶ ??? (expand) │   │
│   └────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Each node in the LATS tree represents a partial trajectory — the state of the world after a sequence of (thought, action, observation) triples. Each node stores:

  • The trajectory up to this point
  • The node's value estimate (Q)
  • The visit count (N)
  • A list of child nodes

LATS Pseudocode

# High-level LATS algorithm pseudocode

def lats(task: str, tools: list, llm, max_iterations: int = 50) -> str:
    """
    Language Agent Tree Search.
    Returns the best trajectory found within the iteration budget.
    """
    root = Node(state=initial_state(task), parent=None)
    best_trajectory = None
    best_score = float('-inf')

    for _ in range(max_iterations):

        # ── SELECTION ────────────────────────────────────────────
        # Traverse from root, picking child with highest UCB1 score
        node = root
        while node.children and not node.is_terminal():
            node = select_by_ucb1(node)

        # ── EXPANSION ────────────────────────────────────────────
        if not node.is_terminal():
            # Sample K candidate next actions from the LLM
            candidate_actions = llm.sample_actions(
                trajectory=node.trajectory,
                task=task,
                tools=tools,
                n=5,                        # branching factor
                temperature=0.8,            # diversity in actions
            )
            for action in candidate_actions:
                observation = execute_tool(action, tools)
                child = Node(
                    state=node.state + [(action, observation)],
                    parent=node,
                )
                node.add_child(child)
            # Pick one child to simulate
            node = node.children[0]

        # ── SIMULATION ────────────────────────────────────────────
        # Roll out to completion from this node
        terminal_trajectory = rollout(node, task, tools, llm)

        # ── EVALUATION ────────────────────────────────────────────
        score = reward_model(task, terminal_trajectory)
        if score > best_score:
            best_score = score
            best_trajectory = terminal_trajectory

        # ── BACKPROPAGATION ───────────────────────────────────────
        # Walk back up to root, updating Q and N
        propagate_reward(node, score)

    return extract_answer(best_trajectory)


def select_by_ucb1(node: Node, C: float = 1.414) -> Node:
    """Select child node with highest Upper Confidence Bound."""
    import math
    best_child = None
    best_score = float('-inf')
    for child in node.children:
        if child.visit_count == 0:
            return child   # always explore unvisited nodes first
        ucb = (child.cumulative_reward / child.visit_count +
               C * math.sqrt(math.log(node.visit_count) / child.visit_count))
        if ucb > best_score:
            best_score = ucb
            best_child = child
    return best_child


def propagate_reward(node: Node, score: float) -> None:
    """Backpropagate reward from leaf to root."""
    current = node
    while current is not None:
        current.visit_count += 1
        current.cumulative_reward += score
        current = current.parent

The Reward Model

The reward model is the most critical and most difficult component. For LATS to work, the reward model must distinguish between good and bad partial trajectories before the task is complete.

REWARD_PROMPT = """You are an expert evaluator assessing how well an agent
is progressing on a task.

Task: {task}

Agent's trajectory so far:
{trajectory}

Rate the quality of this trajectory on a scale from -1.0 to 1.0:
  -1.0: Clearly wrong direction, wasted steps, or made the task harder
   0.0: Neither helpful nor harmful; neutral progress
  +0.5: Making reasonable progress with good tool choices
  +1.0: Excellent progress; likely to reach the correct answer

Reply with ONLY a float between -1.0 and 1.0, nothing else.
"""

def reward_model(task: str, trajectory: list, llm) -> float:
    """Evaluate trajectory quality using LLM-as-judge."""
    trajectory_text = format_trajectory(trajectory)
    response = llm.invoke(
        REWARD_PROMPT.format(task=task, trajectory=trajectory_text)
    )
    try:
        return float(response.content.strip())
    except ValueError:
        return 0.0   # default if model output is malformed

Important: The reward model is a bottleneck. It is called once per node expansion. In a tree with branching factor 5 and 50 iterations, that's potentially 250 LLM calls just for scoring. Use the smallest model that achieves acceptable reward quality.


LATS vs. Reflexion vs. Plan-and-Execute

DimensionReflexionPlan-and-ExecuteLATS
Search strategyLinear, retryLinear, replanTree search
Backtracking depthTask-levelStep-levelNode-level (any depth)
ParallelismSequentialSequentialBranches explorable in parallel
Compute costMediumMediumHigh
Best forSingle-attempt improvementLong sequential tasksComplex tasks with many decision points
Handles dead endsPoorlyModeratelyWell (prunes via backprop)
ExplainabilityHigh (explicit reflection)High (explicit plan)Moderate (tree structure visible)

Exploration vs. Exploitation Tuning

The exploration constant C in UCB1 is your primary tuning knob:

  • High C (e.g., 2.0): Explore broadly; good when you don't know which paths are promising. Risk: wastes budget on clearly bad branches.
  • Low C (e.g., 0.5): Exploit known-good paths; good when your reward model is reliable. Risk: gets stuck in local optima.

A common heuristic: start with C = √2 ≈ 1.414 and tune based on whether your agent is exploring too broadly or too narrowly in practice.

# Adaptive exploration: reduce C as iterations increase
def adaptive_C(iteration: int, max_iterations: int) -> float:
    """Start exploratory, become exploitative as budget runs low."""
    import math
    progress = iteration / max_iterations
    C_start = 2.0
    C_end   = 0.5
    return C_start + (C_end - C_start) * progress

Practical Considerations

Memory: Trees can grow large. At branching factor 5 and depth 10, you have up to 5^10 ≈ 10M nodes (though MCTS typically explores far fewer). Prune nodes with N == 0 and Q < threshold after each round.

Parallelism: MCTS is naturally parallelisable. You can expand multiple branches simultaneously using async tool calls, dramatically reducing wall-clock time.

Termination: Define explicit terminal conditions: (a) agent outputs a final answer, (b) maximum trajectory length reached, (c) reward exceeds acceptance threshold.

Tip: LATS is most valuable for tasks where there are multiple valid solution paths but some are significantly more efficient than others — for example, multi-step code debugging, scientific hypothesis testing, or complex web navigation. For tasks with a single obvious solution path, the overhead rarely pays off.


Summary

LATS is the most computationally intensive pattern in this module, but also the most powerful for tasks with a large, branching solution space. By applying MCTS to agent trajectories, it gains the ability to explore many possible action sequences in parallel, prune dead ends early, and converge on high-quality solutions without committing to a single path upfront. The key implementation challenges are (1) designing a reliable reward model and (2) managing the exponential growth of the search tree within a reasonable compute budget.