The Reflexion Pattern: Agents That Learn From Their Own Mistakes
The Core Insight
Humans rarely solve hard problems on the first attempt. We draft, review, critique, and revise. Until recently, LLM agents lacked this ability — they produced a single output and stopped. The Reflexion pattern, introduced by Shinn et al. (2023), gives agents a structured mechanism for self-improvement through verbal reinforcement learning.
Rather than updating model weights (which requires training infrastructure and days of compute), Reflexion stores self-generated feedback in an episodic memory buffer that informs subsequent attempts. The agent effectively teaches itself within the context window — no fine-tuning required.
The Three-Actor Architecture
The Reflexion paper defines three distinct components that work in a closed feedback loop:
┌──────────────────────────────────────────────┐
│ REFLEXION AGENT LOOP │
│ │
Task ──────────►│ ┌─────────┐ Trajectory ┌─────────────┐ │
│ │ Actor │ ────────────► │ Evaluator │ │
│ └─────────┘ └──────┬──────┘ │
│ ▲ │ Score │
│ │ ▼ │
│ ┌────┴──────────────────────────────────┐ │
│ │ Self-Reflection LLM │ │
│ │ "What went wrong? How to fix it?" │ │
│ └───────────────────────────────────────┘ │
│ │ Reflection stored in memory │
│ └────────────────────────────────────►│
│ (next attempt) │
└──────────────────────────────────────────────┘
The Actor
The Actor is a standard agent — ReAct, tool-calling, or code-generating — that produces a trajectory: a sequence of thoughts, actions, and observations culminating in an output. Its only job is to attempt the task. It has no awareness of whether it succeeded.
The Evaluator
The Evaluator scores the Actor's trajectory. Critically, the evaluator can be:
- Deterministic — unit tests that pass or fail (ideal for coding tasks)
- Heuristic — string matching, format validation, JSON schema checks
- LLM-based — a separate judge model scoring output quality
Binary pass/fail signals (such as from unit tests) make Reflexion most reliable. Fuzzy LLM scoring introduces variance that can mislead the reflection step.
The Self-Reflection LLM
This is the heart of the pattern. Given the failed trajectory and the evaluator's verdict, the Self-Reflection LLM generates a verbal critique: a natural language diagnosis of what went wrong and what to try differently next. This critique is stored in the agent's memory and prepended to the next Actor invocation.
Implementation with LangGraph
LangGraph's graph-based approach maps naturally onto Reflexion's looping structure:
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from typing import TypedDict, Annotated
import operator
# ─── State Schema ─────────────────────────────────────────────────────────────
class ReflexionState(TypedDict):
task: str
task_metadata: dict # Holds test cases
attempts: Annotated[list[dict], operator.add] # Accumulated over iterations
reflections: Annotated[list[str], operator.add] # Verbal feedback history
current_output: str
eval_results: list[dict]
score: float
passed: bool
iteration: int
max_iterations: int
# ─── Actor Node ───────────────────────────────────────────────────────────────
ACTOR_SYSTEM = """You are an expert programmer. Solve the given task.
Output ONLY executable Python code wrapped in ```python ... ``` fences.
Include all necessary imports inside the code block.
"""
def actor_node(state: ReflexionState) -> dict:
"""
The Actor attempts the task, informed by all prior reflections.
Reflections are injected as persistent context — the agent's working memory.
"""
llm = ChatOpenAI(model="gpt-4o", temperature=0.2)
messages = [SystemMessage(content=ACTOR_SYSTEM)]
if state["reflections"]:
memory = "\n\n".join([
f"Attempt {i+1} — what went wrong:\n{r}"
for i, r in enumerate(state["reflections"])
])
messages.append(HumanMessage(
content=(
f"Previous attempts and lessons learned:\n{memory}\n\n"
f"Now solve the task:\n{state['task']}"
)
))
else:
messages.append(HumanMessage(content=f"Task:\n{state['task']}"))
response = llm.invoke(messages)
# Strip markdown fences if present
raw = response.content.strip()
code = raw.split("```python")[-1].split("```")[0].strip() if "```" in raw else raw
return {
"current_output": code,
"attempts": [{"iteration": state["iteration"], "output": code}],
"iteration": state["iteration"] + 1,
}
# ─── Evaluator Node ───────────────────────────────────────────────────────────
def evaluator_node(state: ReflexionState) -> dict:
"""
Run the Actor's generated code against the test suite.
Returns a normalized score and a detailed result list.
"""
code = state["current_output"]
tests = state["task_metadata"]["tests"]
passed_count = 0
results = []
for test in tests:
try:
exec_globals: dict = {}
exec(code, exec_globals)
func = exec_globals[test["function"]]
actual = func(*test["args"])
success = actual == test["expected"]
except Exception as exc:
success = False
actual = f"ERROR: {exc}"
results.append({
"test": test,
"passed": success,
"actual": actual,
})
if success:
passed_count += 1
score = passed_count / len(tests)
return {
"score": score,
"passed": score == 1.0,
"eval_results": results,
}
# ─── Self-Reflection Node ─────────────────────────────────────────────────────
REFLECTION_SYSTEM = """You are performing structured self-reflection on a failed coding attempt.
Write a concise, actionable critique (3-5 sentences) that:
- Identifies the ROOT CAUSE of each failure (not just the symptom)
- Is concrete enough to prevent the exact same mistake next time
- Proposes a specific fix or alternative approach
Bad: "The code failed. Try again more carefully."
Good: "The function fails on negative inputs because the guard clause on line 2 uses
strict greater-than instead of greater-than-or-equal. The boundary case where
n == 0 must be handled explicitly as a base case returning 1, before the
recursive call."
"""
def reflection_node(state: ReflexionState) -> dict:
"""
Generate verbal self-reflection from a failed attempt.
The critique becomes persistent memory injected into the next Actor call.
"""
llm = ChatOpenAI(model="gpt-4o", temperature=0)
eval_summary = "\n".join([
f" Test {i+1}: {'✓ PASS' if r['passed'] else '✗ FAIL'} | "
f"Input: {r['test']['args']} | "
f"Expected: {r['test']['expected']} | "
f"Got: {r['actual']}"
for i, r in enumerate(state["eval_results"])
])
prompt = (
f"Task:\n{state['task']}\n\n"
f"Generated Code:\n```python\n{state['current_output']}\n```\n\n"
f"Test Results (Score: {state['score']:.0%}):\n{eval_summary}\n\n"
"Write your self-reflection:"
)
response = llm.invoke([
SystemMessage(content=REFLECTION_SYSTEM),
HumanMessage(content=prompt),
])
return {"reflections": [response.content]}
# ─── Routing ──────────────────────────────────────────────────────────────────
def should_continue(state: ReflexionState) -> str:
"""Decide whether to reflect, stop on success, or give up after max iterations."""
if state["passed"]:
return "success"
if state["iteration"] >= state["max_iterations"]:
return "give_up"
return "reflect"
# ─── Graph Assembly ───────────────────────────────────────────────────────────
def build_reflexion_graph():
graph = StateGraph(ReflexionState)
graph.add_node("actor", actor_node)
graph.add_node("evaluator", evaluator_node)
graph.add_node("reflect", reflection_node)
graph.set_entry_point("actor")
graph.add_edge("actor", "evaluator")
graph.add_conditional_edges(
"evaluator",
should_continue,
{"reflect": "reflect", "success": END, "give_up": END},
)
graph.add_edge("reflect", "actor")
return graph.compile()
Running the Reflexion Agent
initial_state: ReflexionState = {
"task": (
"Write a Python function `count_islands(grid)` that takes a 2D list of "
"'1' and '0' strings and returns the number of distinct islands. "
"An island is formed by adjacent '1's connected horizontally or vertically."
),
"task_metadata": {
"tests": [
{
"function": "count_islands",
"args": [[["1","1","0"],["0","1","0"],["0","0","1"]]],
"expected": 2,
},
{
"function": "count_islands",
"args": [[["1","1","1"],["0","1","0"],["1","1","1"]]],
"expected": 1,
},
{
"function": "count_islands",
"args": [[["0","0","0"],["0","0","0"]]],
"expected": 0,
},
]
},
"attempts": [],
"reflections": [],
"current_output": "",
"eval_results": [],
"score": 0.0,
"passed": False,
"iteration": 0,
"max_iterations": 4,
}
agent = build_reflexion_graph()
final = agent.invoke(initial_state)
print(f"Solved: {final['passed']} | Attempts: {final['iteration']} | Score: {final['score']:.0%}")
for i, r in enumerate(final["reflections"]):
print(f"\n--- Reflection after attempt {i+1} ---\n{r}")
Memory Window Strategies
As the agent iterates, reflections accumulate. Managing this context is critical for both cost and quality:
| Strategy | Description | Tradeoff |
|---|---|---|
| Full history | All reflections prepended | High token cost, best recall |
| Sliding window | Last N reflections only | Budget-friendly, may lose early insights |
| Summarized | Reflections summarized into one | Compressed, risks losing specifics |
| External store | Stored in vector DB, retrieved | Cross-session learning, added complexity |
def format_reflection_memory(
reflections: list[str],
max_chars: int = 2000,
) -> str:
"""
Truncate reflection history to fit a character budget.
Prioritizes the most recent reflections — they are most relevant.
"""
lines = []
remaining = max_chars
for i, reflection in enumerate(reversed(reflections)):
header = f"Attempt {len(reflections) - i} reflection:"
entry = f"{header}\n{reflection}"
if len(entry) > remaining:
break
lines.insert(0, entry)
remaining -= len(entry)
return "\n\n".join(lines)
When Reflexion Excels — and When It Fails
Ideal conditions
- The evaluator is deterministic: test pass/fail, schema validation, exact match
- The task has a verifiable ground truth: coding challenges, structured extraction, math proofs
- The model is capable enough to correctly diagnose its own mistakes from the error signal
Limitations
- Fuzzy LLM-scored evaluation introduces noise; the agent may optimize for the judge rather than the actual task
- Knowledge gaps cannot be fixed by reflection — if the model lacks a fact, it will keep hallucinating it
- Compounding errors: if early reflections are wrong, they can mislead later attempts
- Each iteration multiplies LLM cost: 3 iterations = roughly 3× the inference spend
Practical Rule: Cap Reflexion at 3 iterations for production systems. Empirical results from the original paper show diminishing returns beyond this threshold for most task categories.
Observing Improvement Across Attempts
Here is a realistic trajectory on a coding task:
| Attempt | Score | Failure Mode | Reflection Key Insight |
|---|---|---|---|
| 1 | 33% | Missing edge case: empty grid | "Guard against len(grid) == 0 before accessing grid[0]" |
| 2 | 67% | DFS doesn't mark diagonals correctly | "Islands are 4-directional only; remove diagonal moves from DFS" |
| 3 | 100% | — | — |
Note: The original Reflexion paper demonstrated significant gains on HotpotQA (+14%), AlfWorld (+22%), and HumanEval (+17%) over standard ReAct baselines. The coding improvements were the most pronounced because test suites provide unambiguous binary evaluation signals.
Key Takeaways
- Reflexion replaces gradient descent with verbal reinforcement — the agent's self-critique is its training signal, operating entirely within the inference context.
- The evaluator quality is the single most important design decision. Binary, deterministic signals produce the best feedback loops.
- The pattern is most powerful for iterative correctness tasks: code generation, structured data extraction, mathematical reasoning.
- Improvements are ephemeral by default — they exist only within the current session. Persist reflections to a vector store to enable cross-session learning.
- Always pair Reflexion with cost controls: unlimited reflection loops can exhaust token budgets on hard tasks that the model fundamentally cannot solve.