Structured Output and JSON Mode

Structured output coercion is the practice of forcing LLM responses into a specific, machine-parseable schema. In production AI systems, unstructured text responses are the exception — most pipelines require the model to return typed data that can be directly consumed by downstream code without fragile string parsing.

The Problem with Free-Form Output

When you ask an LLM a question in production and need to parse the answer, free-form text is a liability:

# Fragile: relies on the model consistently using "Answer:" prefix
response_text = "Based on my analysis, the sentiment is clearly positive. Answer: Positive"
# Breaks when model says "The answer is: Positive" or "Positive." or "Sentiment: Positive"
answer = response_text.split("Answer:")[1].strip()  # Fails on variations

Structured output eliminates this class of bugs by making the schema a contract between you and the model.

Three Levels of Structured Output

Level 1: Format Instructions (Soft Constraint)

Instruct the model in the prompt:

system = """Always respond with valid JSON matching this schema:
{"answer": string, "confidence": float, "sources": string[]}
Do not include any text outside the JSON object."""

Reliability: ~85%. The model occasionally adds explanatory text or uses slightly different keys.

Level 2: JSON Mode (Hard JSON Guarantee)

API-level enforcement of syntactic JSON validity:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_object"},
    messages=[...]
)
# Guaranteed: json.loads() will always succeed
# Not guaranteed: the JSON matches your specific schema

Reliability: 100% for JSON syntax, ~90% for schema compliance with good prompting.

Level 3: Structured Outputs (Schema-Level Guarantee)

API-level enforcement of a specific JSON schema using constrained decoding:

from pydantic import BaseModel
from typing import Literal

class SentimentResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral", "mixed"]
    confidence: float  # 0.0 to 1.0
    key_phrases: list[str]  # Phrases that drove the classification
    recommended_action: Literal["escalate", "respond", "monitor", "close"]

completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Analyze customer feedback sentiment."},
        {"role": "user", "content": feedback_text}
    ],
    response_format=SentimentResult,
)

result: SentimentResult = completion.choices[0].message.parsed
# result is a fully typed Pydantic model — guaranteed schema compliance

Reliability: 100% schema compliance. The model's token generation is constrained by the schema.

How Constrained Decoding Works

Structured outputs work by applying a mask to the model's vocabulary at each token generation step. At any given position in the output, only tokens that could validly appear in a JSON structure matching the schema are allowed. This means the model literally cannot generate an invalid response — it's filtered at the token level, not the output level.

The tradeoff: constrained decoding is slightly slower and requires the model to "fit" its answer into the schema, which can occasionally compress or slightly alter information to fit the types.

Complex Schema Patterns

from pydantic import BaseModel, Field, field_validator
from typing import Optional, Union
from datetime import date

class ExtractedEvent(BaseModel):
    name: str = Field(description="Event name or title")
    date: Optional[str] = Field(None, description="ISO 8601 date string YYYY-MM-DD or null if not specified")
    location: Optional[str] = None
    attendee_count: Optional[int] = Field(None, ge=1, description="Number of attendees, must be positive")
    event_type: Literal["conference", "webinar", "workshop", "meeting", "other"]
    
    @field_validator("date")
    @classmethod
    def validate_date_format(cls, v: str | None) -> str | None:
        if v is None:
            return None
        try:
            date.fromisoformat(v)
            return v
        except ValueError:
            raise ValueError(f"Date must be ISO 8601 format (YYYY-MM-DD), got: {v}")

class EventList(BaseModel):
    events: list[ExtractedEvent]
    extraction_notes: Optional[str] = Field(
        None,
        description="Notes about ambiguous cases or missing information"
    )

# Extract multiple events from text
completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract all events mentioned in the text."},
        {"role": "user", "content": "The AI summit on March 15, 2025 in San Francisco had 3000 attendees. The team meeting next Tuesday is online."}
    ],
    response_format=EventList,
)

event_list: EventList = completion.choices[0].message.parsed
for event in event_list.events:
    print(f"{event.name}: {event.date} @ {event.location or 'online'}")

When to Use Each Level

Use Case	Recommended Level
Internal tooling, low volume	Level 1 (format instructions)
Production API, parseable output needed	Level 2 (JSON mode)
Mission-critical, typed data in database	Level 3 (structured outputs)
Anthropic models (no structured output API)	Level 1 + Level 2 combined

For Anthropic's Claude, which doesn't support structured outputs at API level, combine strong format instructions with Pydantic validation and retry-on-parse-failure logic to approximate the same reliability.