Beyond ReAct

Reflexion and Self-Critique

13m read

The Reflexion Pattern: Agents That Learn From Their Own Mistakes

The Core Insight

Humans rarely solve hard problems on the first attempt. We draft, review, critique, and revise. Until recently, LLM agents lacked this ability — they produced a single output and stopped. The Reflexion pattern, introduced by Shinn et al. (2023), gives agents a structured mechanism for self-improvement through verbal reinforcement learning.

Rather than updating model weights (which requires training infrastructure and days of compute), Reflexion stores self-generated feedback in an episodic memory buffer that informs subsequent attempts. The agent effectively teaches itself within the context window — no fine-tuning required.


The Three-Actor Architecture

The Reflexion paper defines three distinct components that work in a closed feedback loop:

                  ┌──────────────────────────────────────────────┐
                  │             REFLEXION AGENT LOOP             │
                  │                                              │
  Task ──────────►│  ┌─────────┐  Trajectory  ┌─────────────┐  │
                  │  │  Actor  │ ────────────► │  Evaluator  │  │
                  │  └─────────┘              └──────┬──────┘  │
                  │       ▲                          │ Score    │
                  │       │                          ▼         │
                  │  ┌────┴──────────────────────────────────┐  │
                  │  │         Self-Reflection LLM           │  │
                  │  │  "What went wrong? How to fix it?"    │  │
                  │  └───────────────────────────────────────┘  │
                  │       │  Reflection stored in memory        │
                  │       └────────────────────────────────────►│
                  │                  (next attempt)             │
                  └──────────────────────────────────────────────┘

The Actor

The Actor is a standard agent — ReAct, tool-calling, or code-generating — that produces a trajectory: a sequence of thoughts, actions, and observations culminating in an output. Its only job is to attempt the task. It has no awareness of whether it succeeded.

The Evaluator

The Evaluator scores the Actor's trajectory. Critically, the evaluator can be:

  • Deterministic — unit tests that pass or fail (ideal for coding tasks)
  • Heuristic — string matching, format validation, JSON schema checks
  • LLM-based — a separate judge model scoring output quality

Binary pass/fail signals (such as from unit tests) make Reflexion most reliable. Fuzzy LLM scoring introduces variance that can mislead the reflection step.

The Self-Reflection LLM

This is the heart of the pattern. Given the failed trajectory and the evaluator's verdict, the Self-Reflection LLM generates a verbal critique: a natural language diagnosis of what went wrong and what to try differently next. This critique is stored in the agent's memory and prepended to the next Actor invocation.


Implementation with LangGraph

LangGraph's graph-based approach maps naturally onto Reflexion's looping structure:

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from typing import TypedDict, Annotated
import operator

# ─── State Schema ─────────────────────────────────────────────────────────────

class ReflexionState(TypedDict):
    task: str
    task_metadata: dict                              # Holds test cases
    attempts: Annotated[list[dict], operator.add]   # Accumulated over iterations
    reflections: Annotated[list[str], operator.add] # Verbal feedback history
    current_output: str
    eval_results: list[dict]
    score: float
    passed: bool
    iteration: int
    max_iterations: int

# ─── Actor Node ───────────────────────────────────────────────────────────────

ACTOR_SYSTEM = """You are an expert programmer. Solve the given task.
Output ONLY executable Python code wrapped in ```python ... ``` fences.
Include all necessary imports inside the code block.
"""

def actor_node(state: ReflexionState) -> dict:
    """
    The Actor attempts the task, informed by all prior reflections.
    Reflections are injected as persistent context — the agent's working memory.
    """
    llm = ChatOpenAI(model="gpt-4o", temperature=0.2)
    messages = [SystemMessage(content=ACTOR_SYSTEM)]

    if state["reflections"]:
        memory = "\n\n".join([
            f"Attempt {i+1} — what went wrong:\n{r}"
            for i, r in enumerate(state["reflections"])
        ])
        messages.append(HumanMessage(
            content=(
                f"Previous attempts and lessons learned:\n{memory}\n\n"
                f"Now solve the task:\n{state['task']}"
            )
        ))
    else:
        messages.append(HumanMessage(content=f"Task:\n{state['task']}"))

    response = llm.invoke(messages)
    # Strip markdown fences if present
    raw = response.content.strip()
    code = raw.split("```python")[-1].split("```")[0].strip() if "```" in raw else raw

    return {
        "current_output": code,
        "attempts": [{"iteration": state["iteration"], "output": code}],
        "iteration": state["iteration"] + 1,
    }

# ─── Evaluator Node ───────────────────────────────────────────────────────────

def evaluator_node(state: ReflexionState) -> dict:
    """
    Run the Actor's generated code against the test suite.
    Returns a normalized score and a detailed result list.
    """
    code = state["current_output"]
    tests = state["task_metadata"]["tests"]
    passed_count = 0
    results = []

    for test in tests:
        try:
            exec_globals: dict = {}
            exec(code, exec_globals)
            func = exec_globals[test["function"]]
            actual = func(*test["args"])
            success = actual == test["expected"]
        except Exception as exc:
            success = False
            actual = f"ERROR: {exc}"

        results.append({
            "test": test,
            "passed": success,
            "actual": actual,
        })
        if success:
            passed_count += 1

    score = passed_count / len(tests)
    return {
        "score": score,
        "passed": score == 1.0,
        "eval_results": results,
    }

# ─── Self-Reflection Node ─────────────────────────────────────────────────────

REFLECTION_SYSTEM = """You are performing structured self-reflection on a failed coding attempt.

Write a concise, actionable critique (3-5 sentences) that:
- Identifies the ROOT CAUSE of each failure (not just the symptom)
- Is concrete enough to prevent the exact same mistake next time
- Proposes a specific fix or alternative approach

Bad: "The code failed. Try again more carefully."
Good: "The function fails on negative inputs because the guard clause on line 2 uses
  strict greater-than instead of greater-than-or-equal. The boundary case where
  n == 0 must be handled explicitly as a base case returning 1, before the
  recursive call."
"""

def reflection_node(state: ReflexionState) -> dict:
    """
    Generate verbal self-reflection from a failed attempt.
    The critique becomes persistent memory injected into the next Actor call.
    """
    llm = ChatOpenAI(model="gpt-4o", temperature=0)

    eval_summary = "\n".join([
        f"  Test {i+1}: {'✓ PASS' if r['passed'] else '✗ FAIL'} | "
        f"Input: {r['test']['args']} | "
        f"Expected: {r['test']['expected']} | "
        f"Got: {r['actual']}"
        for i, r in enumerate(state["eval_results"])
    ])

    prompt = (
        f"Task:\n{state['task']}\n\n"
        f"Generated Code:\n```python\n{state['current_output']}\n```\n\n"
        f"Test Results (Score: {state['score']:.0%}):\n{eval_summary}\n\n"
        "Write your self-reflection:"
    )

    response = llm.invoke([
        SystemMessage(content=REFLECTION_SYSTEM),
        HumanMessage(content=prompt),
    ])
    return {"reflections": [response.content]}

# ─── Routing ──────────────────────────────────────────────────────────────────

def should_continue(state: ReflexionState) -> str:
    """Decide whether to reflect, stop on success, or give up after max iterations."""
    if state["passed"]:
        return "success"
    if state["iteration"] >= state["max_iterations"]:
        return "give_up"
    return "reflect"

# ─── Graph Assembly ───────────────────────────────────────────────────────────

def build_reflexion_graph():
    graph = StateGraph(ReflexionState)

    graph.add_node("actor", actor_node)
    graph.add_node("evaluator", evaluator_node)
    graph.add_node("reflect", reflection_node)

    graph.set_entry_point("actor")
    graph.add_edge("actor", "evaluator")
    graph.add_conditional_edges(
        "evaluator",
        should_continue,
        {"reflect": "reflect", "success": END, "give_up": END},
    )
    graph.add_edge("reflect", "actor")

    return graph.compile()

Running the Reflexion Agent

initial_state: ReflexionState = {
    "task": (
        "Write a Python function `count_islands(grid)` that takes a 2D list of "
        "'1' and '0' strings and returns the number of distinct islands. "
        "An island is formed by adjacent '1's connected horizontally or vertically."
    ),
    "task_metadata": {
        "tests": [
            {
                "function": "count_islands",
                "args": [[["1","1","0"],["0","1","0"],["0","0","1"]]],
                "expected": 2,
            },
            {
                "function": "count_islands",
                "args": [[["1","1","1"],["0","1","0"],["1","1","1"]]],
                "expected": 1,
            },
            {
                "function": "count_islands",
                "args": [[["0","0","0"],["0","0","0"]]],
                "expected": 0,
            },
        ]
    },
    "attempts": [],
    "reflections": [],
    "current_output": "",
    "eval_results": [],
    "score": 0.0,
    "passed": False,
    "iteration": 0,
    "max_iterations": 4,
}

agent = build_reflexion_graph()
final = agent.invoke(initial_state)

print(f"Solved: {final['passed']} | Attempts: {final['iteration']} | Score: {final['score']:.0%}")
for i, r in enumerate(final["reflections"]):
    print(f"\n--- Reflection after attempt {i+1} ---\n{r}")

Memory Window Strategies

As the agent iterates, reflections accumulate. Managing this context is critical for both cost and quality:

StrategyDescriptionTradeoff
Full historyAll reflections prependedHigh token cost, best recall
Sliding windowLast N reflections onlyBudget-friendly, may lose early insights
SummarizedReflections summarized into oneCompressed, risks losing specifics
External storeStored in vector DB, retrievedCross-session learning, added complexity
def format_reflection_memory(
    reflections: list[str],
    max_chars: int = 2000,
) -> str:
    """
    Truncate reflection history to fit a character budget.
    Prioritizes the most recent reflections — they are most relevant.
    """
    lines = []
    remaining = max_chars
    for i, reflection in enumerate(reversed(reflections)):
        header = f"Attempt {len(reflections) - i} reflection:"
        entry = f"{header}\n{reflection}"
        if len(entry) > remaining:
            break
        lines.insert(0, entry)
        remaining -= len(entry)
    return "\n\n".join(lines)

When Reflexion Excels — and When It Fails

Ideal conditions

  • The evaluator is deterministic: test pass/fail, schema validation, exact match
  • The task has a verifiable ground truth: coding challenges, structured extraction, math proofs
  • The model is capable enough to correctly diagnose its own mistakes from the error signal

Limitations

  • Fuzzy LLM-scored evaluation introduces noise; the agent may optimize for the judge rather than the actual task
  • Knowledge gaps cannot be fixed by reflection — if the model lacks a fact, it will keep hallucinating it
  • Compounding errors: if early reflections are wrong, they can mislead later attempts
  • Each iteration multiplies LLM cost: 3 iterations = roughly 3× the inference spend

Practical Rule: Cap Reflexion at 3 iterations for production systems. Empirical results from the original paper show diminishing returns beyond this threshold for most task categories.


Observing Improvement Across Attempts

Here is a realistic trajectory on a coding task:

AttemptScoreFailure ModeReflection Key Insight
133%Missing edge case: empty grid"Guard against len(grid) == 0 before accessing grid[0]"
267%DFS doesn't mark diagonals correctly"Islands are 4-directional only; remove diagonal moves from DFS"
3100%

Note: The original Reflexion paper demonstrated significant gains on HotpotQA (+14%), AlfWorld (+22%), and HumanEval (+17%) over standard ReAct baselines. The coding improvements were the most pronounced because test suites provide unambiguous binary evaluation signals.


Key Takeaways

  • Reflexion replaces gradient descent with verbal reinforcement — the agent's self-critique is its training signal, operating entirely within the inference context.
  • The evaluator quality is the single most important design decision. Binary, deterministic signals produce the best feedback loops.
  • The pattern is most powerful for iterative correctness tasks: code generation, structured data extraction, mathematical reasoning.
  • Improvements are ephemeral by default — they exist only within the current session. Persist reflections to a vector store to enable cross-session learning.
  • Always pair Reflexion with cost controls: unlimited reflection loops can exhaust token budgets on hard tasks that the model fundamentally cannot solve.