Integrating the OpenAI API

OpenAI's API is the most widely used LLM interface in production systems. This lesson covers everything you need to integrate it correctly: authentication, the chat completions endpoint, streaming, structured output, and the official Python SDK.

Installation and Authentication

pip install openai python-dotenv

# config.py
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

# The SDK automatically reads OPENAI_API_KEY from the environment
client = OpenAI()

# Or explicitly (useful for multi-tenant systems)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

The Chat Completions Endpoint

The primary endpoint for modern OpenAI models is /v1/chat/completions:

# Basic chat completion
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how backpropagation works in 3 sentences."},
    ],
    temperature=0.3,        # 0=deterministic, 1=creative
    max_tokens=500,         # Limit response length
    top_p=1.0,              # Nucleus sampling (usually leave at 1.0)
)

# Access the response
text = response.choices[0].message.content
print(text)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Key Parameters Explained

Parameter	Range	Effect
`temperature`	0.0 – 2.0	Controls randomness. 0 = near-deterministic, 1+ = creative
`max_tokens`	1 – context limit	Maximum response length
`top_p`	0.0 – 1.0	Nucleus sampling threshold. 0.9 = sample from top 90% probability mass
`n`	1+	Number of completions to generate (increases cost proportionally)
`stop`	string or list	Stop generation when this string appears
`seed`	integer	Attempt reproducible outputs (not guaranteed)

Streaming Responses

For interactive applications, stream tokens as they generate:

# Stream with real-time output
import sys

with client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a haiku about machine learning."}],
    stream=True,
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            sys.stdout.write(delta)
            sys.stdout.flush()
    print()  # Final newline

# Async streaming (better for web servers)
async def stream_response(prompt: str):
    async with client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    ) as stream:
        async for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content

Structured Output (JSON Mode)

For machine-readable responses, use JSON mode:

from pydantic import BaseModel
from typing import List

class BookRecommendation(BaseModel):
    title: str
    author: str
    year: int
    why_relevant: str

class RecommendationList(BaseModel):
    recommendations: List[BookRecommendation]
    total_count: int

# Parse into a Pydantic model directly (OpenAI SDK v1.40+)
completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Recommend books based on the user's interests."},
        {"role": "user", "content": "I love AI, distributed systems, and science fiction."}
    ],
    response_format=RecommendationList,
)

result: RecommendationList = completion.choices[0].message.parsed
for book in result.recommendations:
    print(f"- {book.title} by {book.author} ({book.year}): {book.why_relevant}")

Async for Production

In FastAPI or async web frameworks, use the async client:

from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def get_ai_response(user_message: str) -> str:
    response = await async_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_message}],
        temperature=0,
    )
    return response.choices[0].message.content

Cost Tracking

Always track token usage to monitor costs:

from dataclasses import dataclass, field
from threading import Lock

@dataclass
class TokenTracker:
    prompt_tokens: int = 0
    completion_tokens: int = 0
    _lock: Lock = field(default_factory=Lock)

    def record(self, usage) -> None:
        with self._lock:
            self.prompt_tokens += usage.prompt_tokens
            self.completion_tokens += usage.completion_tokens

    def estimate_cost(self, input_price_per_m: float = 0.15, output_price_per_m: float = 0.60) -> float:
        """Prices for gpt-4o-mini as of 2024."""
        return (
            (self.prompt_tokens / 1_000_000) * input_price_per_m
            + (self.completion_tokens / 1_000_000) * output_price_per_m
        )

tracker = TokenTracker()
response = client.chat.completions.create(model="gpt-4o-mini", messages=[...])
tracker.record(response.usage)
print(f"Estimated cost: ${tracker.estimate_cost():.6f}")

With this foundation, you're ready to build production-grade integrations. The next lesson covers the Anthropic API, which has important differences in its message format and extended context handling.