Scaling Multi-Agent Systems
Overview
Building a multi-agent system that works for ten requests per day is very different from one that handles ten thousand. Scaling introduces new challenges: coordinating many concurrent agent instances, balancing load without creating bottlenecks, managing costs as token consumption grows linearly with scale, and choosing deployment architectures that don't become brittle under load.
This lesson covers horizontal scaling strategies, agent pooling, cost optimization through model tiering, and production deployment patterns for multi-agent systems.
The Scaling Problem is Different for Agents
Traditional microservices scale by adding more compute. Agent systems are different because:
- LLM calls are expensive — each agent action costs money. Poor design can make costs grow faster than traffic.
- Context windows have limits — you can't just "cache more." Stateful agent conversations have hard token limits.
- Coordination overhead grows non-linearly — doubling agents can more than double coordination costs.
- Model providers have rate limits — concurrent agents may all hit the same API rate limit simultaneously.
Effective scaling requires addressing each of these dimensions specifically.
Horizontal Scaling Strategies
Strategy 1: Stateless Agent Workers
The simplest and most scalable pattern: make each agent invocation completely stateless. All context is passed in the request payload; no state lives on the agent worker itself.
┌─────────────────────────┐
│ Load Balancer │
│ (round-robin / RPS) │
└─────────┬───────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────────▼────┐ ┌────────▼────┐ ┌────────▼────┐
│ Agent Pod │ │ Agent Pod │ │ Agent Pod │
│ Worker 1 │ │ Worker 2 │ │ Worker 3 │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└───────────────┼───────────────┘
│
┌─────────▼───────────────┐
│ Shared State Store │
│ (Redis / PostgreSQL) │
└─────────────────────────┘
Each pod processes one agent task at a time. The load balancer distributes tasks. Shared state (blackboard, task queue, results) lives in Redis.
The rule: If an agent pod dies mid-task, the task can be retried by any other pod without data loss.
Strategy 2: Agent Pools
Maintain pools of pre-initialized agent instances of each specialization. Instead of spawning a new agent for every request, pull from the pool:
import asyncio
from contextlib import asynccontextmanager
from dataclasses import dataclass
from typing import Callable, AsyncIterator
import time
@dataclass
class PooledAgent:
agent_id: str
agent_type: str
agent_instance: Any
in_use: bool = False
created_at: float = 0.0
requests_handled: int = 0
last_used_at: float = 0.0
def __post_init__(self):
self.created_at = time.time()
self.last_used_at = time.time()
class AgentPool:
"""
Pool of reusable agent instances of the same type.
Reduces initialization overhead and enables connection reuse for tools.
"""
def __init__(
self,
agent_type: str,
factory: Callable[[], Any],
min_size: int = 2,
max_size: int = 10,
max_requests_per_agent: int = 1000,
idle_timeout_seconds: float = 300.0,
):
self.agent_type = agent_type
self.factory = factory
self.min_size = min_size
self.max_size = max_size
self.max_requests = max_requests_per_agent
self.idle_timeout = idle_timeout_seconds
self._pool: list[PooledAgent] = []
self._lock = asyncio.Lock()
self._available = asyncio.Semaphore(max_size)
async def initialize(self) -> None:
"""Pre-warm the pool with min_size agents."""
for i in range(self.min_size):
agent = self._create_agent(i)
self._pool.append(agent)
print(f"[AgentPool:{self.agent_type}] Pre-warmed with {self.min_size} agents.")
def _create_agent(self, index: int) -> PooledAgent:
instance = self.factory()
return PooledAgent(
agent_id=f"{self.agent_type}-{index}-{int(time.time())}",
agent_type=self.agent_type,
agent_instance=instance,
)
@asynccontextmanager
async def acquire(self) -> AsyncIterator[PooledAgent]:
"""Context manager: acquire an agent from the pool, return when done."""
await self._available.acquire()
agent = None
async with self._lock:
# Find an available, healthy agent
for candidate in self._pool:
if not candidate.in_use and candidate.requests_handled < self.max_requests:
agent = candidate
agent.in_use = True
break
# None available — create a new one if under max_size
if agent is None and len(self._pool) < self.max_size:
agent = self._create_agent(len(self._pool))
agent.in_use = True
self._pool.append(agent)
print(f"[AgentPool:{self.agent_type}] Scaled up to {len(self._pool)} agents.")
if agent is None:
self._available.release()
raise RuntimeError(f"Agent pool '{self.agent_type}' exhausted.")
try:
yield agent
finally:
async with self._lock:
agent.in_use = False
agent.requests_handled += 1
agent.last_used_at = time.time()
self._available.release()
async def scale_down_idle(self) -> int:
"""Remove agents that have been idle beyond idle_timeout. Returns count removed."""
now = time.time()
async with self._lock:
to_remove = [
a for a in self._pool
if not a.in_use
and (now - a.last_used_at) > self.idle_timeout
and len(self._pool) > self.min_size
]
for agent in to_remove:
self._pool.remove(agent)
return len(to_remove)
@property
def stats(self) -> dict:
in_use = sum(1 for a in self._pool if a.in_use)
return {
"agent_type": self.agent_type,
"pool_size": len(self._pool),
"in_use": in_use,
"available": len(self._pool) - in_use,
"total_requests": sum(a.requests_handled for a in self._pool),
}
# Usage
researcher_pool = AgentPool(
agent_type="researcher",
factory=lambda: ObservableAgent("researcher", "research analyst", "gpt-4o-mini"),
min_size=2,
max_size=8,
)
async def handle_research_request(task: str, correlation_id: str) -> str:
async with researcher_pool.acquire() as agent:
return agent.run(task, correlation_id)
Cost Optimization Through Model Tiering
Not every task needs GPT-4o. A carefully designed tiering strategy can reduce LLM costs by 60–80% while maintaining output quality.
Tiering Strategy
| Tier | Model | Cost (approx) | Use For |
|---|---|---|---|
| T1 — Nano | GPT-4o-mini, Claude Haiku | ~$0.10/1M tokens | Routing, classification, simple extraction |
| T2 — Standard | GPT-4o, Claude Sonnet | ~$3–5/1M tokens | Standard agent tasks, analysis, code gen |
| T3 — Premium | o1, Claude Opus | ~$15–75/1M tokens | Complex reasoning, architectural decisions |
from enum import Enum
from typing import Optional
class ModelTier(Enum):
NANO = "nano"
STANDARD = "standard"
PREMIUM = "premium"
TIER_MODELS = {
ModelTier.NANO: "gpt-4o-mini",
ModelTier.STANDARD: "gpt-4o",
ModelTier.PREMIUM: "o1-preview",
}
def classify_task_complexity(task: str) -> ModelTier:
"""
Use a cheap model to classify task complexity before routing to the right model.
Meta-routing: spend ~$0.001 to save potentially $0.10+ on misdirected calls.
"""
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
classifier = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
response = classifier.invoke([
SystemMessage(content="""Classify the complexity of this agent task.
Reply with exactly one word: NANO, STANDARD, or PREMIUM.
NANO: routing, classification, simple Q&A, extraction from structured data
STANDARD: research, analysis, code generation, writing, summarization
PREMIUM: multi-step reasoning, strategic planning, complex debugging, architecture decisions"""),
HumanMessage(content=task)
])
tier_str = response.content.strip().upper()
try:
return ModelTier[tier_str]
except KeyError:
return ModelTier.STANDARD # safe default
class TieredAgentRouter:
"""Routes tasks to the appropriate model tier based on complexity classification."""
def __init__(self, pools: dict[ModelTier, AgentPool]):
self.pools = pools
self._classification_cost_saved = 0.0
async def route_and_execute(self, task: str, correlation_id: str) -> str:
tier = classify_task_complexity(task)
model = TIER_MODELS[tier]
print(f"[TieredRouter] Task classified as {tier.value}, using model: {model}")
async with self.pools[tier].acquire() as agent:
return agent.run(task, correlation_id)
Tip: Measure your actual tier distribution in production. Most teams find that 60–70% of agent tasks fall into NANO or STANDARD, with PREMIUM being genuinely rare. Optimize the routing classifier heavily — every misclassified task that goes to PREMIUM when STANDARD would suffice is 10–30× the necessary cost.
Deployment Patterns
Pattern 1: Monolith (All Agents in One Process)
┌─────────────────────────────────────┐
│ Agent System Process │
│ ┌──────────┐ ┌──────────────────┐ │
│ │Supervisor│ │ Worker Agents │ │
│ │ Agent │──│ Research/Analyst │ │
│ └──────────┘ │ Writer/Reviewer │ │
│ └──────────────────┘ │
│ ┌──────────────────────────────┐ │
│ │ In-Memory Message Bus │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
Pros: Simple deployment, no network overhead between agents, easy debugging Cons: Single point of failure, can't scale individual agents independently Use when: Prototype, low traffic, team of <3 agents
Pattern 2: Microservices (Each Agent as a Service)
┌────────────┐ HTTP/gRPC ┌──────────────────┐
│ Supervisor │─────────────►│ Researcher Service│
│ Service │ └──────────────────┘
│ (port 8000)│ HTTP/gRPC ┌──────────────────┐
│ │─────────────►│ Analyst Service │
│ │ └──────────────────┘
│ │ HTTP/gRPC ┌──────────────────┐
│ │─────────────►│ Writer Service │
└────────────┘ └──────────────────┘
Pros: Independent scaling per agent type, isolated failures, independent deployment Cons: Network latency between agents, more complex operations, distributed tracing required Use when: Different agents have very different resource/scaling needs
Pattern 3: Event-Driven (Agents Consume from Queues)
User Request
│
▼
┌──────────┐ ┌─────────┐ ┌──────────────┐
│Supervisor│───►│ Queue │───►│ Worker Pods │
│ Service │ │(Redis/ │ │ (auto-scaled)│
└──────────┘ │Kafka) │ └──────┬───────┘
└─────────┘ │
│ results
▼
┌───────────────┐
│ Results Store │
│ (Redis/DB) │
└───────────────┘
Pros: Best scalability, agents scale based on queue depth, natural backpressure Cons: Async only, more infrastructure complexity, harder to implement request-response patterns Use when: High volume async workflows, cost optimization is critical
Kubernetes Deployment Configuration
# researcher-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: researcher-agent
labels:
app: researcher-agent
tier: agent
spec:
replicas: 2 # start small, HPA will scale up
selector:
matchLabels:
app: researcher-agent
template:
metadata:
labels:
app: researcher-agent
spec:
containers:
- name: researcher-agent
image: your-registry/researcher-agent:v1.2.0
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "2Gi"
env:
- name: AGENT_TYPE
value: "researcher"
- name: MODEL_TIER
value: "standard"
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: agent-secrets
key: redis-url
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: agent-secrets
key: openai-api-key
- name: MAX_CONCURRENT_TASKS
value: "5"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
---
# Horizontal Pod Autoscaler based on queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: researcher-agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: researcher-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: redis_queue_depth
selector:
matchLabels:
queue: research_tasks
target:
type: AverageValue
averageValue: "10" # scale up when queue depth > 10 per pod
# agent-rate-limiter configmap
apiVersion: v1
kind: ConfigMap
metadata:
name: agent-rate-limits
data:
config.yaml: |
rate_limits:
researcher:
requests_per_minute: 60
tokens_per_minute: 100000
burst: 10
analyst:
requests_per_minute: 30
tokens_per_minute: 200000
burst: 5
writer:
requests_per_minute: 20
tokens_per_minute: 150000
burst: 3
cost_guardrails:
max_cost_per_workflow_usd: 1.00
max_cost_per_hour_usd: 50.00
alert_threshold_usd: 40.00
Cost Monitoring and Guardrails
import threading
class CostGuardrail:
"""Enforces per-workflow and system-wide cost limits."""
def __init__(self, max_per_workflow: float, max_per_hour: float):
self.max_per_workflow = max_per_workflow
self.max_per_hour = max_per_hour
self._workflow_costs: dict[str, float] = {}
self._hourly_total: float = 0.0
self._lock = threading.Lock()
def record_cost(self, correlation_id: str, cost_usd: float) -> None:
with self._lock:
self._workflow_costs[correlation_id] = (
self._workflow_costs.get(correlation_id, 0.0) + cost_usd
)
self._hourly_total += cost_usd
# Check per-workflow limit
workflow_total = self._workflow_costs[correlation_id]
if workflow_total > self.max_per_workflow:
raise RuntimeError(
f"Workflow '{correlation_id}' exceeded cost limit: "
f"${workflow_total:.4f} > ${self.max_per_workflow:.2f}"
)
# Check hourly limit
if self._hourly_total > self.max_per_hour:
raise RuntimeError(
f"Hourly cost limit exceeded: ${self._hourly_total:.2f}"
)
def get_workflow_cost(self, correlation_id: str) -> float:
return self._workflow_costs.get(correlation_id, 0.0)
Scaling Decision Matrix
| Traffic | Agent Count | Recommended Pattern | Key Infra |
|---|---|---|---|
| <100 req/day | 2–5 agents | Monolith | Single VM or container |
| 100–1K req/day | 3–8 agents | Microservices | Docker Compose or small K8s |
| 1K–10K req/day | 5–15 agents | Event-driven | Redis Streams + K8s HPA |
| >10K req/day | Any count | Event-driven + model tiering | Kafka + K8s + CDN caching |
Key Takeaways
- Make agent workers stateless from day one — it is dramatically cheaper to add statelessness early than to retrofit it later
- Agent pools eliminate the cold-start overhead of initializing LLM clients and tools on every request
- Model tiering (routing easy tasks to cheaper models) typically cuts costs by 60–80% with minimal quality impact
- The monolith → microservices → event-driven progression mirrors traditional web services, but model costs make the event-driven pattern more economically compelling earlier
- Use Kubernetes HPA scaled on queue depth, not CPU — agent pods are mostly idle while waiting for LLM responses, so CPU is a poor scaling signal
- Always implement cost guardrails before going to production; runaway agent loops can generate thousands of dollars in API costs within minutes