Integrating the OpenAI API
OpenAI's API is the most widely used LLM interface in production systems. This lesson covers everything you need to integrate it correctly: authentication, the chat completions endpoint, streaming, structured output, and the official Python SDK.
Installation and Authentication
pip install openai python-dotenv
# config.py
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
# The SDK automatically reads OPENAI_API_KEY from the environment
client = OpenAI()
# Or explicitly (useful for multi-tenant systems)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
The Chat Completions Endpoint
The primary endpoint for modern OpenAI models is /v1/chat/completions:
# Basic chat completion
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain how backpropagation works in 3 sentences."},
],
temperature=0.3, # 0=deterministic, 1=creative
max_tokens=500, # Limit response length
top_p=1.0, # Nucleus sampling (usually leave at 1.0)
)
# Access the response
text = response.choices[0].message.content
print(text)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
Key Parameters Explained
| Parameter | Range | Effect |
|---|---|---|
temperature | 0.0 – 2.0 | Controls randomness. 0 = near-deterministic, 1+ = creative |
max_tokens | 1 – context limit | Maximum response length |
top_p | 0.0 – 1.0 | Nucleus sampling threshold. 0.9 = sample from top 90% probability mass |
n | 1+ | Number of completions to generate (increases cost proportionally) |
stop | string or list | Stop generation when this string appears |
seed | integer | Attempt reproducible outputs (not guaranteed) |
Streaming Responses
For interactive applications, stream tokens as they generate:
# Stream with real-time output
import sys
with client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Write a haiku about machine learning."}],
stream=True,
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
sys.stdout.write(delta)
sys.stdout.flush()
print() # Final newline
# Async streaming (better for web servers)
async def stream_response(prompt: str):
async with client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
stream=True,
) as stream:
async for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
Structured Output (JSON Mode)
For machine-readable responses, use JSON mode:
from pydantic import BaseModel
from typing import List
class BookRecommendation(BaseModel):
title: str
author: str
year: int
why_relevant: str
class RecommendationList(BaseModel):
recommendations: List[BookRecommendation]
total_count: int
# Parse into a Pydantic model directly (OpenAI SDK v1.40+)
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Recommend books based on the user's interests."},
{"role": "user", "content": "I love AI, distributed systems, and science fiction."}
],
response_format=RecommendationList,
)
result: RecommendationList = completion.choices[0].message.parsed
for book in result.recommendations:
print(f"- {book.title} by {book.author} ({book.year}): {book.why_relevant}")
Async for Production
In FastAPI or async web frameworks, use the async client:
from openai import AsyncOpenAI
async_client = AsyncOpenAI()
async def get_ai_response(user_message: str) -> str:
response = await async_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_message}],
temperature=0,
)
return response.choices[0].message.content
Cost Tracking
Always track token usage to monitor costs:
from dataclasses import dataclass, field
from threading import Lock
@dataclass
class TokenTracker:
prompt_tokens: int = 0
completion_tokens: int = 0
_lock: Lock = field(default_factory=Lock)
def record(self, usage) -> None:
with self._lock:
self.prompt_tokens += usage.prompt_tokens
self.completion_tokens += usage.completion_tokens
def estimate_cost(self, input_price_per_m: float = 0.15, output_price_per_m: float = 0.60) -> float:
"""Prices for gpt-4o-mini as of 2024."""
return (
(self.prompt_tokens / 1_000_000) * input_price_per_m
+ (self.completion_tokens / 1_000_000) * output_price_per_m
)
tracker = TokenTracker()
response = client.chat.completions.create(model="gpt-4o-mini", messages=[...])
tracker.record(response.usage)
print(f"Estimated cost: ${tracker.estimate_cost():.6f}")
With this foundation, you're ready to build production-grade integrations. The next lesson covers the Anthropic API, which has important differences in its message format and extended context handling.