Scaling Multi-Agent Systems

Overview

Building a multi-agent system that works for ten requests per day is very different from one that handles ten thousand. Scaling introduces new challenges: coordinating many concurrent agent instances, balancing load without creating bottlenecks, managing costs as token consumption grows linearly with scale, and choosing deployment architectures that don't become brittle under load.

This lesson covers horizontal scaling strategies, agent pooling, cost optimization through model tiering, and production deployment patterns for multi-agent systems.

The Scaling Problem is Different for Agents

Traditional microservices scale by adding more compute. Agent systems are different because:

LLM calls are expensive — each agent action costs money. Poor design can make costs grow faster than traffic.
Context windows have limits — you can't just "cache more." Stateful agent conversations have hard token limits.
Coordination overhead grows non-linearly — doubling agents can more than double coordination costs.
Model providers have rate limits — concurrent agents may all hit the same API rate limit simultaneously.

Effective scaling requires addressing each of these dimensions specifically.

Horizontal Scaling Strategies

Strategy 1: Stateless Agent Workers

The simplest and most scalable pattern: make each agent invocation completely stateless. All context is passed in the request payload; no state lives on the agent worker itself.

                    ┌─────────────────────────┐
                    │      Load Balancer       │
                    │   (round-robin / RPS)    │
                    └─────────┬───────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
     ┌────────▼────┐ ┌────────▼────┐ ┌────────▼────┐
     │  Agent Pod  │ │  Agent Pod  │ │  Agent Pod  │
     │  Worker 1   │ │  Worker 2   │ │  Worker 3   │
     └─────────────┘ └─────────────┘ └─────────────┘
              │               │               │
              └───────────────┼───────────────┘
                              │
                    ┌─────────▼───────────────┐
                    │    Shared State Store    │
                    │   (Redis / PostgreSQL)   │
                    └─────────────────────────┘

Each pod processes one agent task at a time. The load balancer distributes tasks. Shared state (blackboard, task queue, results) lives in Redis.

The rule: If an agent pod dies mid-task, the task can be retried by any other pod without data loss.

Strategy 2: Agent Pools

Maintain pools of pre-initialized agent instances of each specialization. Instead of spawning a new agent for every request, pull from the pool:

import asyncio
from contextlib import asynccontextmanager
from dataclasses import dataclass
from typing import Callable, AsyncIterator
import time

@dataclass
class PooledAgent:
    agent_id: str
    agent_type: str
    agent_instance: Any
    in_use: bool = False
    created_at: float = 0.0
    requests_handled: int = 0
    last_used_at: float = 0.0

    def __post_init__(self):
        self.created_at = time.time()
        self.last_used_at = time.time()

class AgentPool:
    """
    Pool of reusable agent instances of the same type.
    Reduces initialization overhead and enables connection reuse for tools.
    """

    def __init__(
        self,
        agent_type: str,
        factory: Callable[[], Any],
        min_size: int = 2,
        max_size: int = 10,
        max_requests_per_agent: int = 1000,
        idle_timeout_seconds: float = 300.0,
    ):
        self.agent_type = agent_type
        self.factory = factory
        self.min_size = min_size
        self.max_size = max_size
        self.max_requests = max_requests_per_agent
        self.idle_timeout = idle_timeout_seconds

        self._pool: list[PooledAgent] = []
        self._lock = asyncio.Lock()
        self._available = asyncio.Semaphore(max_size)

    async def initialize(self) -> None:
        """Pre-warm the pool with min_size agents."""
        for i in range(self.min_size):
            agent = self._create_agent(i)
            self._pool.append(agent)
        print(f"[AgentPool:{self.agent_type}] Pre-warmed with {self.min_size} agents.")

    def _create_agent(self, index: int) -> PooledAgent:
        instance = self.factory()
        return PooledAgent(
            agent_id=f"{self.agent_type}-{index}-{int(time.time())}",
            agent_type=self.agent_type,
            agent_instance=instance,
        )

    @asynccontextmanager
    async def acquire(self) -> AsyncIterator[PooledAgent]:
        """Context manager: acquire an agent from the pool, return when done."""
        await self._available.acquire()
        agent = None

        async with self._lock:
            # Find an available, healthy agent
            for candidate in self._pool:
                if not candidate.in_use and candidate.requests_handled < self.max_requests:
                    agent = candidate
                    agent.in_use = True
                    break

            # None available — create a new one if under max_size
            if agent is None and len(self._pool) < self.max_size:
                agent = self._create_agent(len(self._pool))
                agent.in_use = True
                self._pool.append(agent)
                print(f"[AgentPool:{self.agent_type}] Scaled up to {len(self._pool)} agents.")

        if agent is None:
            self._available.release()
            raise RuntimeError(f"Agent pool '{self.agent_type}' exhausted.")

        try:
            yield agent
        finally:
            async with self._lock:
                agent.in_use = False
                agent.requests_handled += 1
                agent.last_used_at = time.time()
                self._available.release()

    async def scale_down_idle(self) -> int:
        """Remove agents that have been idle beyond idle_timeout. Returns count removed."""
        now = time.time()
        async with self._lock:
            to_remove = [
                a for a in self._pool
                if not a.in_use
                and (now - a.last_used_at) > self.idle_timeout
                and len(self._pool) > self.min_size
            ]
            for agent in to_remove:
                self._pool.remove(agent)
            return len(to_remove)

    @property
    def stats(self) -> dict:
        in_use = sum(1 for a in self._pool if a.in_use)
        return {
            "agent_type": self.agent_type,
            "pool_size": len(self._pool),
            "in_use": in_use,
            "available": len(self._pool) - in_use,
            "total_requests": sum(a.requests_handled for a in self._pool),
        }

# Usage
researcher_pool = AgentPool(
    agent_type="researcher",
    factory=lambda: ObservableAgent("researcher", "research analyst", "gpt-4o-mini"),
    min_size=2,
    max_size=8,
)

async def handle_research_request(task: str, correlation_id: str) -> str:
    async with researcher_pool.acquire() as agent:
        return agent.run(task, correlation_id)

Cost Optimization Through Model Tiering

Not every task needs GPT-4o. A carefully designed tiering strategy can reduce LLM costs by 60–80% while maintaining output quality.

Tiering Strategy

Tier	Model	Cost (approx)	Use For
T1 — Nano	GPT-4o-mini, Claude Haiku	~$0.10/1M tokens	Routing, classification, simple extraction
T2 — Standard	GPT-4o, Claude Sonnet	~$3–5/1M tokens	Standard agent tasks, analysis, code gen
T3 — Premium	o1, Claude Opus	~$15–75/1M tokens	Complex reasoning, architectural decisions

from enum import Enum
from typing import Optional

class ModelTier(Enum):
    NANO = "nano"
    STANDARD = "standard"
    PREMIUM = "premium"

TIER_MODELS = {
    ModelTier.NANO: "gpt-4o-mini",
    ModelTier.STANDARD: "gpt-4o",
    ModelTier.PREMIUM: "o1-preview",
}

def classify_task_complexity(task: str) -> ModelTier:
    """
    Use a cheap model to classify task complexity before routing to the right model.
    Meta-routing: spend ~$0.001 to save potentially $0.10+ on misdirected calls.
    """
    from langchain_openai import ChatOpenAI
    from langchain_core.messages import SystemMessage, HumanMessage

    classifier = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
    response = classifier.invoke([
        SystemMessage(content="""Classify the complexity of this agent task.
Reply with exactly one word: NANO, STANDARD, or PREMIUM.

NANO: routing, classification, simple Q&A, extraction from structured data
STANDARD: research, analysis, code generation, writing, summarization  
PREMIUM: multi-step reasoning, strategic planning, complex debugging, architecture decisions"""),
        HumanMessage(content=task)
    ])
    tier_str = response.content.strip().upper()
    try:
        return ModelTier[tier_str]
    except KeyError:
        return ModelTier.STANDARD  # safe default

class TieredAgentRouter:
    """Routes tasks to the appropriate model tier based on complexity classification."""

    def __init__(self, pools: dict[ModelTier, AgentPool]):
        self.pools = pools
        self._classification_cost_saved = 0.0

    async def route_and_execute(self, task: str, correlation_id: str) -> str:
        tier = classify_task_complexity(task)
        model = TIER_MODELS[tier]

        print(f"[TieredRouter] Task classified as {tier.value}, using model: {model}")

        async with self.pools[tier].acquire() as agent:
            return agent.run(task, correlation_id)

Tip: Measure your actual tier distribution in production. Most teams find that 60–70% of agent tasks fall into NANO or STANDARD, with PREMIUM being genuinely rare. Optimize the routing classifier heavily — every misclassified task that goes to PREMIUM when STANDARD would suffice is 10–30× the necessary cost.

Deployment Patterns

Pattern 1: Monolith (All Agents in One Process)

┌─────────────────────────────────────┐
│         Agent System Process        │
│  ┌──────────┐  ┌──────────────────┐ │
│  │Supervisor│  │  Worker Agents   │ │
│  │  Agent   │──│ Research/Analyst │ │
│  └──────────┘  │  Writer/Reviewer │ │
│                └──────────────────┘ │
│  ┌──────────────────────────────┐   │
│  │    In-Memory Message Bus     │   │
│  └──────────────────────────────┘   │
└─────────────────────────────────────┘

Pros: Simple deployment, no network overhead between agents, easy debugging Cons: Single point of failure, can't scale individual agents independently Use when: Prototype, low traffic, team of <3 agents

Pattern 2: Microservices (Each Agent as a Service)

┌────────────┐  HTTP/gRPC   ┌──────────────────┐
│ Supervisor │─────────────►│ Researcher Service│
│  Service   │              └──────────────────┘
│  (port 8000)│  HTTP/gRPC   ┌──────────────────┐
│            │─────────────►│  Analyst Service  │
│            │              └──────────────────┘
│            │  HTTP/gRPC   ┌──────────────────┐
│            │─────────────►│  Writer Service   │
└────────────┘              └──────────────────┘

Pros: Independent scaling per agent type, isolated failures, independent deployment Cons: Network latency between agents, more complex operations, distributed tracing required Use when: Different agents have very different resource/scaling needs

Pattern 3: Event-Driven (Agents Consume from Queues)

User Request
    │
    ▼
┌──────────┐    ┌─────────┐    ┌──────────────┐
│Supervisor│───►│  Queue  │───►│ Worker Pods  │
│  Service │    │(Redis/  │    │ (auto-scaled)│
└──────────┘    │Kafka)   │    └──────┬───────┘
                └─────────┘           │
                                      │ results
                                      ▼
                              ┌───────────────┐
                              │ Results Store │
                              │  (Redis/DB)   │
                              └───────────────┘

Pros: Best scalability, agents scale based on queue depth, natural backpressure Cons: Async only, more infrastructure complexity, harder to implement request-response patterns Use when: High volume async workflows, cost optimization is critical

Kubernetes Deployment Configuration

# researcher-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: researcher-agent
  labels:
    app: researcher-agent
    tier: agent
spec:
  replicas: 2        # start small, HPA will scale up
  selector:
    matchLabels:
      app: researcher-agent
  template:
    metadata:
      labels:
        app: researcher-agent
    spec:
      containers:
      - name: researcher-agent
        image: your-registry/researcher-agent:v1.2.0
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "1000m"
            memory: "2Gi"
        env:
        - name: AGENT_TYPE
          value: "researcher"
        - name: MODEL_TIER
          value: "standard"
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: agent-secrets
              key: redis-url
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: agent-secrets
              key: openai-api-key
        - name: MAX_CONCURRENT_TASKS
          value: "5"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
---
# Horizontal Pod Autoscaler based on queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: researcher-agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: researcher-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: redis_queue_depth
        selector:
          matchLabels:
            queue: research_tasks
      target:
        type: AverageValue
        averageValue: "10"   # scale up when queue depth > 10 per pod

# agent-rate-limiter configmap
apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-rate-limits
data:
  config.yaml: |
    rate_limits:
      researcher:
        requests_per_minute: 60
        tokens_per_minute: 100000
        burst: 10
      analyst:
        requests_per_minute: 30
        tokens_per_minute: 200000
        burst: 5
      writer:
        requests_per_minute: 20
        tokens_per_minute: 150000
        burst: 3
    
    cost_guardrails:
      max_cost_per_workflow_usd: 1.00
      max_cost_per_hour_usd: 50.00
      alert_threshold_usd: 40.00

Cost Monitoring and Guardrails

import threading

class CostGuardrail:
    """Enforces per-workflow and system-wide cost limits."""

    def __init__(self, max_per_workflow: float, max_per_hour: float):
        self.max_per_workflow = max_per_workflow
        self.max_per_hour = max_per_hour
        self._workflow_costs: dict[str, float] = {}
        self._hourly_total: float = 0.0
        self._lock = threading.Lock()

    def record_cost(self, correlation_id: str, cost_usd: float) -> None:
        with self._lock:
            self._workflow_costs[correlation_id] = (
                self._workflow_costs.get(correlation_id, 0.0) + cost_usd
            )
            self._hourly_total += cost_usd

            # Check per-workflow limit
            workflow_total = self._workflow_costs[correlation_id]
            if workflow_total > self.max_per_workflow:
                raise RuntimeError(
                    f"Workflow '{correlation_id}' exceeded cost limit: "
                    f"${workflow_total:.4f} > ${self.max_per_workflow:.2f}"
                )

            # Check hourly limit
            if self._hourly_total > self.max_per_hour:
                raise RuntimeError(
                    f"Hourly cost limit exceeded: ${self._hourly_total:.2f}"
                )

    def get_workflow_cost(self, correlation_id: str) -> float:
        return self._workflow_costs.get(correlation_id, 0.0)

Scaling Decision Matrix

Traffic	Agent Count	Recommended Pattern	Key Infra
<100 req/day	2–5 agents	Monolith	Single VM or container
100–1K req/day	3–8 agents	Microservices	Docker Compose or small K8s
1K–10K req/day	5–15 agents	Event-driven	Redis Streams + K8s HPA
>10K req/day	Any count	Event-driven + model tiering	Kafka + K8s + CDN caching

Key Takeaways

Make agent workers stateless from day one — it is dramatically cheaper to add statelessness early than to retrofit it later
Agent pools eliminate the cold-start overhead of initializing LLM clients and tools on every request
Model tiering (routing easy tasks to cheaper models) typically cuts costs by 60–80% with minimal quality impact
The monolith → microservices → event-driven progression mirrors traditional web services, but model costs make the event-driven pattern more economically compelling earlier
Use Kubernetes HPA scaled on queue depth, not CPU — agent pods are mostly idle while waiting for LLM responses, so CPU is a poor scaling signal
Always implement cost guardrails before going to production; runaway agent loops can generate thousands of dollars in API costs within minutes