API Integration

Integrating the OpenAI API

13m read

Integrating the OpenAI API

OpenAI's API is the most widely used LLM interface in production systems. This lesson covers everything you need to integrate it correctly: authentication, the chat completions endpoint, streaming, structured output, and the official Python SDK.

Installation and Authentication

pip install openai python-dotenv
# config.py
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

# The SDK automatically reads OPENAI_API_KEY from the environment
client = OpenAI()

# Or explicitly (useful for multi-tenant systems)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

The Chat Completions Endpoint

The primary endpoint for modern OpenAI models is /v1/chat/completions:

# Basic chat completion
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how backpropagation works in 3 sentences."},
    ],
    temperature=0.3,        # 0=deterministic, 1=creative
    max_tokens=500,         # Limit response length
    top_p=1.0,              # Nucleus sampling (usually leave at 1.0)
)

# Access the response
text = response.choices[0].message.content
print(text)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Key Parameters Explained

ParameterRangeEffect
temperature0.0 – 2.0Controls randomness. 0 = near-deterministic, 1+ = creative
max_tokens1 – context limitMaximum response length
top_p0.0 – 1.0Nucleus sampling threshold. 0.9 = sample from top 90% probability mass
n1+Number of completions to generate (increases cost proportionally)
stopstring or listStop generation when this string appears
seedintegerAttempt reproducible outputs (not guaranteed)

Streaming Responses

For interactive applications, stream tokens as they generate:

# Stream with real-time output
import sys

with client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a haiku about machine learning."}],
    stream=True,
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            sys.stdout.write(delta)
            sys.stdout.flush()
    print()  # Final newline

# Async streaming (better for web servers)
async def stream_response(prompt: str):
    async with client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    ) as stream:
        async for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content

Structured Output (JSON Mode)

For machine-readable responses, use JSON mode:

from pydantic import BaseModel
from typing import List

class BookRecommendation(BaseModel):
    title: str
    author: str
    year: int
    why_relevant: str

class RecommendationList(BaseModel):
    recommendations: List[BookRecommendation]
    total_count: int

# Parse into a Pydantic model directly (OpenAI SDK v1.40+)
completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Recommend books based on the user's interests."},
        {"role": "user", "content": "I love AI, distributed systems, and science fiction."}
    ],
    response_format=RecommendationList,
)

result: RecommendationList = completion.choices[0].message.parsed
for book in result.recommendations:
    print(f"- {book.title} by {book.author} ({book.year}): {book.why_relevant}")

Async for Production

In FastAPI or async web frameworks, use the async client:

from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def get_ai_response(user_message: str) -> str:
    response = await async_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_message}],
        temperature=0,
    )
    return response.choices[0].message.content

Cost Tracking

Always track token usage to monitor costs:

from dataclasses import dataclass, field
from threading import Lock

@dataclass
class TokenTracker:
    prompt_tokens: int = 0
    completion_tokens: int = 0
    _lock: Lock = field(default_factory=Lock)

    def record(self, usage) -> None:
        with self._lock:
            self.prompt_tokens += usage.prompt_tokens
            self.completion_tokens += usage.completion_tokens

    def estimate_cost(self, input_price_per_m: float = 0.15, output_price_per_m: float = 0.60) -> float:
        """Prices for gpt-4o-mini as of 2024."""
        return (
            (self.prompt_tokens / 1_000_000) * input_price_per_m
            + (self.completion_tokens / 1_000_000) * output_price_per_m
        )

tracker = TokenTracker()
response = client.chat.completions.create(model="gpt-4o-mini", messages=[...])
tracker.record(response.usage)
print(f"Estimated cost: ${tracker.estimate_cost():.6f}")

With this foundation, you're ready to build production-grade integrations. The next lesson covers the Anthropic API, which has important differences in its message format and extended context handling.