Designing an AI Inference API at Scale (Anthropic/OpenAI Level)

By a Senior Software Architect

Designing an AI inference system for millions of users isn't just about wrapping a Python script in a Flask app. When you operate at the scale of Anthropic or OpenAI—handling billions of tokens daily, serving critical enterprise workloads, and ensuring safety first—the architecture must be robust, fault-tolerant, and hyper-optimized.

In this deep dive, we will walk through the design of a massive-scale AI Inference API, starting from requirements and removing ambiguity, moving to back-of-the-envelope estimations, and evolving the architecture from a simple prototype to a global distributed system.


1. Requirements & Clarifications

Before drawing a single box, we must ask the right questions.

1.1 Functional Requirements

  • Inference types: Chat completions (interactive), Embeddings (batch), and potentially multimodal.

  • Streaming: Must support Server-Sent Events (SSE) for token-by-token streaming response (critical for perceived latency).

  • Safety: Every request must pass through safety guardrails (input/output filters) with zero bypass capability.

  • Tiers: Support for Free, Pro, and Enterprise tiers with strict isolation and priority.

1.2 Non-Functional Requirements

  • High Availability: 99.99% uptime. AI is now a critical infrastructure dependency.

  • Low Latency:

    • TTFT (Time To First Token): < 200ms (P95). This is the "responsiveness" metric.

    • TPOT (Time Per Output Token): < 50ms. Determines reading speed.

  • Scalability: Horizontal scaling to handle unexpected viral traffic spikes.

  • Reliability: No dropped requests during deployments or partial outages.

1.3 Estimations (The "Anthropic Scale")

Let's reverse-engineer some numbers based on public data (e.g., ChatGPT usage or H100 benchmarks) to understand the magnitude.

  • Daily Requests: Assume ~1 Billion requests/day.

  • Average Request: 1k input tokens, 500 output tokens.

  • Traffic Pattern: Peaky. Daily peak is likely 2-3x the average.

  • Throughput Calculation:

    • $10^9$ requests / 86400 seconds ≈ 11,500 requests/sec (average).

    • Peak Traffic: ~30,000 requests/sec.

    • Token Generation Load: 30k req/sec * 500 estimated tokens ≈ 15 Million tokens/sec (generation).

Infrastructure Implication: If an optimized H100 node (using vLLM or TensorRT-LLM) can generate ~3,000 tokens/sec for a massive model (like Llama 3 70B or Claude Sonnet equivalent):

  • $15,000,000 / 3,000 = 5,000$ H100 GPUs required just for generation (excluding prefill overhead).

  • This is a massive fleet. Efficiency improvements (batching, caching) translate to millions of dollars in savings.


2. High-Level Design Evolution

We don't start at Google-scale. We evolve there.

Phase 1: The "Simple" Wrapper

In the beginning, you might have a load balancer and a few GPU nodes running a model server.

Why this fails at scale:

  1. Head-of-Line Blocking: A long request (generating 4k tokens) blocks a short one if handled synchronously.

  2. GPU Underutilization: Static batching is inefficient.

  3. No Safety Net: If GPUs are overloaded, requests just drop or time out.

Phase 2: Asynchronous & Decoupled Architecture

To handle high concurrency, we must separate the Control Plane (API handling) from the Data Plane (Inference execution).

  • Priority Queue: Enterprise customers get P0, Free users P1.

  • Continuous Batching: The engine doesn't wait for a whole batch to finish. As soon as a request finishes, a new one enters.

Phase 3: The "Anthropic/OpenAI" Scale Architecture

At this level, we introduce KV Caching, Global Traffic Management, and Safety Layers.


3. Component Deep Dive

3.1 The Gateway (The Bouncer)

The gateway is the first line of defense.

  • Authentication: Verify API keys via cached lookups (don't hit DB for every request).

  • Smart Rate Limiting:

    • Token Bucket Algorithm: Allow bursts but enforce average rate.

    • Tiered Limits: Tier 4 users get higher burst capacity.

    • Cost-Based Limiting: Limit based on estimated tokens, not just request count. A request asking for 100k context is more expensive than "Hello World".

3.2 The Orchestrator & Scheduler (The Brain)

This component decides where to run the inference.

  • Affinity Routing: If a user is having a multi-turn conversation, route them to a node that already has their KV Cache loaded (Context Caching). This reduces "Prefill" time to near zero.

  • Load Balancing strategies:

    • Least Outstanding Tokens: Route to the node with the fewest tokens currently in generation queue.

    • Power of Two Choices: Pick two random nodes, choose the best one.

3.3 The Model Engine (The Muscle)

We don't just run model.generate(). We use specialized engines like vLLM.

  • PagedAttention: Fragmenting memory (like OS paging) to eliminate fragmentation.

  • Continuous Batching (Orca):

    • Request A arrives. Start generating.

    • Request B arrives. Insert it into the next token generation step of Request A.

    • Result: GPU works on Request A and B simultaneously without waiting.

  • Speculative Decoding: Use a small "draft" model to guess tokens, and the big model to verify them in parallel. Increases TPS by 2-3x.

3.4 Safety Layer

Safety isn't an addon; it's inline.

  • Input Moderation: Check prompt for jailbreaks/harmful content roughly before expensive inference.

  • Output Moderation: As tokens stream, run lightweight classifiers. If a violation is detected, cut the stream immediately.


4. Failure Handling: Embracing Chaos

In a fleet of 5,000 GPUs, hardware fails every hour.

4.1 "Bad Node" Detection

  • Symptom: A GPU node returns timeouts or garbage output.

  • Mitigation: The Orchestrator tracks success rates. If a node fails N requests, it is cordoned.

  • Drain & Heal: Kubernetes drains the pod, reboots the node, or provisions a replacement.

4.2 Handling Overload (Load Shedding)

  • The Issue: Queue builds up. Latency skyrockets.

  • Bad Solution: Let the queue grow to infinity.

  • Good Solution (Load Shedding): If the estimated wait time in queue > 10 seconds, reject the request immediately with 503 Service Unavailable.

  • Why? It's better to fail fast than to make everyone wait 2 minutes for a timeout.

4.3 Request Retries

  • Idempotency: Inference requests should be idempotent.

  • Jitter: Clients must retry with exponential backoff and jitter to prevent thundering herd problems during recovery.


5. Summary

Designing for scale means obsessing over efficiency.

  1. Batching allows us to serve thousands of concurrent users on limited hardware.

  2. KV Caching (PagedAttention) creates the memory efficiency needed for long context.

  3. Decoupling ensures that a slow model doesn't block the API.

  4. Safety is integrated at the protocol level.

By layering these components—Gateway, Orchestrator, Model Engine, and Safety—we build a system that feels instantaneous to the user, even while doing massive matrix multiplications in the background.

Last updated