Prompt Engineering Basics

Chain-of-Thought Reasoning

11m read

Chain-of-Thought Reasoning

Chain-of-Thought (CoT) prompting, introduced by Wei et al. (2022), is one of the most impactful prompting techniques discovered in recent years. Instead of asking the model to jump directly to an answer, CoT encourages the model to reason step by step — producing intermediate reasoning traces before the final answer. This dramatically improves performance on tasks requiring multi-step reasoning, math, and logic.

Why CoT Works

In standard prompting, the model's context right before generating the answer token contains only the question. In CoT prompting, the context contains both the question and a chain of reasoning — meaning the model generates each reasoning step based on the accumulated context of all previous steps. This gives the model "more thinking room" and reduces the probability of reasoning errors that would cascade into wrong final answers.

A useful mental model: CoT transforms an implicit, single-step reasoning problem into an explicit, multi-step generation problem where each step can be verified.

Zero-Shot CoT: "Let's think step by step"

The simplest CoT technique requires no examples. Just append "Let's think step by step." to your prompt:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Without CoT — often produces wrong answers for multi-step math
without_cot = llm.invoke([
    HumanMessage(content="A train travels 120 miles in 2 hours. A car travels the same 120 miles but takes 30 minutes longer. How much faster (mph) is the train?")
])

# With zero-shot CoT
with_cot = llm.invoke([
    HumanMessage(content="""A train travels 120 miles in 2 hours. A car travels the same 120 miles but takes 30 minutes longer. How much faster (mph) is the train?

Let's think step by step.""")
])

print("Without CoT:", without_cot.content)
print("\nWith CoT:", with_cot.content)

Few-Shot CoT: Providing Reasoning Examples

For reliable results, provide examples that include the full reasoning chain:

from langchain_core.prompts import ChatPromptTemplate

cot_prompt = ChatPromptTemplate.from_messages([
    ("system", "Solve math problems step by step. Show your work before giving the final answer."),
    
    # Example 1 with full reasoning
    ("human", "If a rectangle has width 8cm and length 12cm, what is its area?"),
    ("ai", """Step 1: Recall the formula for rectangle area.
Area = width × length

Step 2: Substitute the values.
Area = 8cm × 12cm

Step 3: Calculate.
Area = 96cm²

Final answer: 96 square centimeters."""),
    
    # Example 2 with full reasoning
    ("human", "A shirt costs $45 and is on sale for 20% off. What is the sale price?"),
    ("ai", """Step 1: Calculate the discount amount.
Discount = 20% of $45 = 0.20 × $45 = $9

Step 2: Subtract the discount from the original price.
Sale price = $45 - $9 = $36

Final answer: $36"""),
    
    # Actual query
    ("human", "{problem}")
])

chain = cot_prompt | llm
result = chain.invoke({"problem": "A tank holds 500 liters. It fills at 25 liters/minute and drains at 10 liters/minute. How long to fill from empty?"})
print(result.content)

Self-Consistency: Sampling Multiple Reasoning Paths

A powerful extension of CoT is self-consistency (Wang et al., 2022): generate multiple reasoning chains and take the majority vote among final answers. This reduces the impact of any single flawed reasoning path.

import asyncio
from collections import Counter

async def self_consistent_cot(problem: str, num_samples: int = 5) -> str:
    """
    Generate multiple CoT reasoning paths and take majority vote.
    Uses temperature > 0 to get diverse reasoning paths.
    """
    llm_diverse = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
    
    prompt = f"{problem}\n\nLet's think step by step. End with 'Final answer: [answer]'"
    
    # Generate diverse reasoning paths concurrently
    tasks = [
        llm_diverse.ainvoke([HumanMessage(content=prompt)])
        for _ in range(num_samples)
    ]
    responses = await asyncio.gather(*tasks)
    
    # Extract final answers
    answers = []
    for r in responses:
        content = r.content
        if "Final answer:" in content:
            answer = content.split("Final answer:")[-1].strip()
            answers.append(answer)
    
    # Majority vote
    if answers:
        most_common = Counter(answers).most_common(1)[0][0]
        return most_common
    return "Could not determine answer"

# Usage
answer = asyncio.run(self_consistent_cot(
    "If 5 cats eat 5 mice in 5 minutes, how long for 100 cats to eat 100 mice?"
))
print(f"Consensus answer: {answer}")

When to Use CoT

CoT significantly helps for:

  • Multi-step math and word problems
  • Logical reasoning (deduction, constraint satisfaction)
  • Code debugging (tracing through execution)
  • Decision making with multiple factors

CoT doesn't help much for:

  • Simple factual recall ("What is the capital of France?")
  • Single-step classification
  • Creative writing (reasoning traces can constrain creativity)

Cost Consideration

CoT produces much longer outputs — the reasoning chain can be 3-10x longer than a direct answer. For high-volume applications, weigh the accuracy improvement against the increased token cost. For production systems, CoT is often reserved for complex requests while simpler queries use direct prompting.