LATS: Language Agent Tree Search
From Linear to Arboreal Reasoning
Every agent pattern we've studied so far — ReAct, Plan-and-Execute, Reflexion — follows a fundamentally linear trajectory. The agent makes decisions, and if those decisions turn out to be wrong, it backtracks via replanning or reflection. But by the time it backtracks, it has already committed to a path and paid the cost of traversing it.
Language Agent Tree Search (LATS), introduced by Zhou et al. (2023), borrows the most powerful idea from game-playing AI — Monte Carlo Tree Search (MCTS) — and applies it to language agent decision-making. Rather than a single linear trajectory, LATS builds a tree of possible action sequences, intelligently exploring the most promising branches while pruning dead ends.
This gives agents something they've never had before: the ability to look ahead before committing, and to backtrack at any depth without losing information about other branches.
Background: Monte Carlo Tree Search
MCTS was famously used in AlphaGo. Its four phases apply directly to language agents:
┌─────────────────┐
│ ROOT NODE │
│ (initial state)│
└────────┬────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
[Action A] [Action B] [Action C]
│ │ │
[A → A1] [B → B1] ✗ (pruned)
[A → A2] [B → B2]
│
[A1 → A1a] ◀── MCTS selects this branch for expansion
The Four MCTS Phases
| Phase | What Happens | In Language Agents |
|---|---|---|
| Selection | Traverse the tree using UCB1 to balance exploration vs. exploitation | Choose which node (partial trajectory) to extend next |
| Expansion | Generate child nodes from selected node | Sample K candidate next actions from the LLM |
| Simulation | Roll out a complete trajectory from the new node | Run the agent to completion from this state |
| Backpropagation | Update node statistics up the tree | Propagate reward/value scores back to ancestors |
The UCB1 selection formula used in LATS:
UCB1(node) = Q(node) / N(node) + C × √(ln(N(parent)) / N(node))
Where:
Q(node)= cumulative reward for this nodeN(node)= number of times this node has been visitedC= exploration constant (typically √2)
High Q/N exploits known good paths. High √(ln(N_parent)/N) explores rarely-visited nodes.
LATS Architecture
┌─────────────────────────────────────────────────────────────────┐
│ LATS Agent │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌─────────┐ │
│ │ Tree │ │ Node │ │ Reward │ │ Memory │ │
│ │ Manager │◀──▶│ Expander │◀──▶│ Model │ │ Buffer │ │
│ │ (MCTS) │ │ (LLM) │ │ (LLM) │ │ │ │
│ └──────────┘ └───────────┘ └──────────┘ └─────────┘ │
│ │ │
│ ┌────▼──────────────────────────────────────────────────┐ │
│ │ Search Tree │ │
│ │ root ──▶ [step1_A] ──▶ [step2_A1] ──▶ [step3_A1a] │ │
│ │ │ │ │
│ │ └──▶ [step1_B] ──▶ [step2_B1] (pruned: -0.2) │ │
│ │ └──▶ [step2_B2] ──▶ ??? (expand) │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Each node in the LATS tree represents a partial trajectory — the state of the world after a sequence of (thought, action, observation) triples. Each node stores:
- The trajectory up to this point
- The node's value estimate (Q)
- The visit count (N)
- A list of child nodes
LATS Pseudocode
# High-level LATS algorithm pseudocode
def lats(task: str, tools: list, llm, max_iterations: int = 50) -> str:
"""
Language Agent Tree Search.
Returns the best trajectory found within the iteration budget.
"""
root = Node(state=initial_state(task), parent=None)
best_trajectory = None
best_score = float('-inf')
for _ in range(max_iterations):
# ── SELECTION ────────────────────────────────────────────
# Traverse from root, picking child with highest UCB1 score
node = root
while node.children and not node.is_terminal():
node = select_by_ucb1(node)
# ── EXPANSION ────────────────────────────────────────────
if not node.is_terminal():
# Sample K candidate next actions from the LLM
candidate_actions = llm.sample_actions(
trajectory=node.trajectory,
task=task,
tools=tools,
n=5, # branching factor
temperature=0.8, # diversity in actions
)
for action in candidate_actions:
observation = execute_tool(action, tools)
child = Node(
state=node.state + [(action, observation)],
parent=node,
)
node.add_child(child)
# Pick one child to simulate
node = node.children[0]
# ── SIMULATION ────────────────────────────────────────────
# Roll out to completion from this node
terminal_trajectory = rollout(node, task, tools, llm)
# ── EVALUATION ────────────────────────────────────────────
score = reward_model(task, terminal_trajectory)
if score > best_score:
best_score = score
best_trajectory = terminal_trajectory
# ── BACKPROPAGATION ───────────────────────────────────────
# Walk back up to root, updating Q and N
propagate_reward(node, score)
return extract_answer(best_trajectory)
def select_by_ucb1(node: Node, C: float = 1.414) -> Node:
"""Select child node with highest Upper Confidence Bound."""
import math
best_child = None
best_score = float('-inf')
for child in node.children:
if child.visit_count == 0:
return child # always explore unvisited nodes first
ucb = (child.cumulative_reward / child.visit_count +
C * math.sqrt(math.log(node.visit_count) / child.visit_count))
if ucb > best_score:
best_score = ucb
best_child = child
return best_child
def propagate_reward(node: Node, score: float) -> None:
"""Backpropagate reward from leaf to root."""
current = node
while current is not None:
current.visit_count += 1
current.cumulative_reward += score
current = current.parent
The Reward Model
The reward model is the most critical and most difficult component. For LATS to work, the reward model must distinguish between good and bad partial trajectories before the task is complete.
REWARD_PROMPT = """You are an expert evaluator assessing how well an agent
is progressing on a task.
Task: {task}
Agent's trajectory so far:
{trajectory}
Rate the quality of this trajectory on a scale from -1.0 to 1.0:
-1.0: Clearly wrong direction, wasted steps, or made the task harder
0.0: Neither helpful nor harmful; neutral progress
+0.5: Making reasonable progress with good tool choices
+1.0: Excellent progress; likely to reach the correct answer
Reply with ONLY a float between -1.0 and 1.0, nothing else.
"""
def reward_model(task: str, trajectory: list, llm) -> float:
"""Evaluate trajectory quality using LLM-as-judge."""
trajectory_text = format_trajectory(trajectory)
response = llm.invoke(
REWARD_PROMPT.format(task=task, trajectory=trajectory_text)
)
try:
return float(response.content.strip())
except ValueError:
return 0.0 # default if model output is malformed
Important: The reward model is a bottleneck. It is called once per node expansion. In a tree with branching factor 5 and 50 iterations, that's potentially 250 LLM calls just for scoring. Use the smallest model that achieves acceptable reward quality.
LATS vs. Reflexion vs. Plan-and-Execute
| Dimension | Reflexion | Plan-and-Execute | LATS |
|---|---|---|---|
| Search strategy | Linear, retry | Linear, replan | Tree search |
| Backtracking depth | Task-level | Step-level | Node-level (any depth) |
| Parallelism | Sequential | Sequential | Branches explorable in parallel |
| Compute cost | Medium | Medium | High |
| Best for | Single-attempt improvement | Long sequential tasks | Complex tasks with many decision points |
| Handles dead ends | Poorly | Moderately | Well (prunes via backprop) |
| Explainability | High (explicit reflection) | High (explicit plan) | Moderate (tree structure visible) |
Exploration vs. Exploitation Tuning
The exploration constant C in UCB1 is your primary tuning knob:
- High C (e.g., 2.0): Explore broadly; good when you don't know which paths are promising. Risk: wastes budget on clearly bad branches.
- Low C (e.g., 0.5): Exploit known-good paths; good when your reward model is reliable. Risk: gets stuck in local optima.
A common heuristic: start with C = √2 ≈ 1.414 and tune based on whether your agent is exploring too broadly or too narrowly in practice.
# Adaptive exploration: reduce C as iterations increase
def adaptive_C(iteration: int, max_iterations: int) -> float:
"""Start exploratory, become exploitative as budget runs low."""
import math
progress = iteration / max_iterations
C_start = 2.0
C_end = 0.5
return C_start + (C_end - C_start) * progress
Practical Considerations
Memory: Trees can grow large. At branching factor 5 and depth 10, you have up to 5^10 ≈ 10M nodes (though MCTS typically explores far fewer). Prune nodes with N == 0 and Q < threshold after each round.
Parallelism: MCTS is naturally parallelisable. You can expand multiple branches simultaneously using async tool calls, dramatically reducing wall-clock time.
Termination: Define explicit terminal conditions: (a) agent outputs a final answer, (b) maximum trajectory length reached, (c) reward exceeds acceptance threshold.
Tip: LATS is most valuable for tasks where there are multiple valid solution paths but some are significantly more efficient than others — for example, multi-step code debugging, scientific hypothesis testing, or complex web navigation. For tasks with a single obvious solution path, the overhead rarely pays off.
Summary
LATS is the most computationally intensive pattern in this module, but also the most powerful for tasks with a large, branching solution space. By applying MCTS to agent trajectories, it gains the ability to explore many possible action sequences in parallel, prune dead ends early, and converge on high-quality solutions without committing to a single path upfront. The key implementation challenges are (1) designing a reliable reward model and (2) managing the exponential growth of the search tree within a reasonable compute budget.