Integrating the Anthropic API

Anthropic's Claude models (Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku) are known for their strong reasoning, long context (200K tokens), and adherence to constitutional AI safety principles. The Anthropic API has a clean design with a few important differences from OpenAI's that are worth understanding.

Installation and Setup

pip install anthropic python-dotenv

# client setup
import os
from dotenv import load_dotenv
import anthropic

load_dotenv()

# Reads ANTHROPIC_API_KEY from environment automatically
client = anthropic.Anthropic()

Key Difference: System Prompt is a Top-Level Parameter

In OpenAI's API, the system message is just another item in the messages array with role: "system". In Anthropic's API, the system prompt is a separate top-level parameter — not part of the messages array:

# OpenAI style (system as message)
openai_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
]

# Anthropic style (system as separate parameter)
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system="You are a helpful assistant.",   # <-- top-level parameter
    messages=[
        {"role": "user", "content": "Hello!"}  # No system role here
    ]
)

print(response.content[0].text)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

Response Structure

Anthropic responses differ structurally from OpenAI:

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    messages=[{"role": "user", "content": "What is a neural network?"}]
)

# Anthropic response structure
print(response.id)                    # msg_01...
print(response.model)                  # claude-3-5-sonnet-20241022
print(response.stop_reason)            # "end_turn" | "max_tokens" | "stop_sequence"
print(response.content[0].type)        # "text"
print(response.content[0].text)        # The actual response text
print(response.usage.input_tokens)
print(response.usage.output_tokens)

Streaming

# Synchronous streaming
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain the concept of entropy in information theory."}],
) as stream:
    for text_chunk in stream.text_stream:
        print(text_chunk, end="", flush=True)

# Get the final message after streaming completes
final_message = stream.get_final_message()
print(f"\nTotal input tokens: {final_message.usage.input_tokens}")

Extended Thinking (Claude 3.7+)

Claude 3.7 Sonnet supports extended thinking — the model shows its reasoning process before answering:

# Extended thinking for complex reasoning tasks
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Max tokens for thinking
    },
    messages=[{
        "role": "user",
        "content": "Prove that there are infinitely many prime numbers."
    }]
)

# Response has two content blocks: thinking + text
for block in response.content:
    if block.type == "thinking":
        print("=== Claude's Reasoning ===")
        print(block.thinking)
    elif block.type == "text":
        print("=== Final Answer ===")
        print(block.text)

Prompt Caching (Cost Reduction)

Anthropic's prompt caching dramatically reduces costs when reusing large system prompts:

# Cache a large system prompt (minimum 1024 tokens to be cacheable)
large_system_prompt = """You are an expert legal analyst specializing in contract law...
[...thousands of tokens of legal context...]"""

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": large_system_prompt,
            "cache_control": {"type": "ephemeral"}  # Cache this block
        }
    ],
    messages=[{"role": "user", "content": "Review this contract clause..."}]
)

# First call: full input token cost
# Subsequent calls with same system: cache read at ~10% cost
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

Model Selection Guide

Model	Best For	Speed	Cost
`claude-3-5-haiku-20241022`	High-volume, latency-sensitive	Fastest	Cheapest
`claude-3-5-sonnet-20241022`	Balance of capability and cost	Fast	Medium
`claude-3-opus-20240229`	Maximum capability (complex tasks)	Slower	Most expensive
`claude-3-7-sonnet-20250219`	Extended thinking, hard reasoning	Variable	Medium-High

For most agent use cases, claude-3-5-sonnet-20241022 offers the best capability-to-cost ratio. Reserve Opus for tasks requiring the highest reasoning quality where cost is not the primary constraint.