System Design: Text-to-Video Generation Pipeline (Sora-like)

From a Single GPU Worker to Preemption-Resilient Distributed Rendering: A Staff Engineer's Guide


Table of Contents


1. The Problem & Why It's Hard

You're asked to design the backend job scheduling and processing pipeline for a text-to-video AI service like OpenAI's Sora. Users submit text prompts, and the system generates videos β€” a process that takes minutes per request, consumes an entire GPU per job, and runs on a fleet of spot instances that can vanish at any moment.

On the surface, it's "just a job queue with GPU workers." The trap is thinking the hard part is execution. The hard part is that your compute infrastructure is actively hostile β€” spot instances get preempted mid-render, each worker can only process one video at a time, and a single 10-second video generation consumes the GPU resources equivalent to thousands of ChatGPT queries ($0.50–$2.00 actual compute cost per video).

The interviewer's real question: Can you design a job scheduling system where workers are unreliable, expensive, and single-threaded β€” and still guarantee every user's video gets generated exactly once, with graceful handling of preemptions that can kill a 10-minute render at minute 9?

The constraints that make this different from a typical job queue:

  1. One worker = one job: A GPU worker can only process one video at a time. No batching, no multiplexing. This means your fleet size directly equals your concurrency.

  2. Workers die without warning: Spot instances get a 2-minute termination notice (AWS) or none at all (some providers). A 10-minute render killed at minute 9 is 9 minutes of GPU time wasted β€” at $3/hour per H100, that's $0.45 burned.

  3. Jobs are expensive and long-running: Unlike sub-second API calls, video generation takes 2–15 minutes. A retry doesn't cost milliseconds β€” it costs dollars and minutes of user wait time.

  4. GPU scarcity is real: H100s cost $3–4/hour on-demand. Spot instances save 60–70% but have 5–15% interruption rates. Your scheduler must optimize for cost while maintaining throughput.

Staff+ Signal: The deceptively hard part isn't the queue or the workers β€” it's checkpoint/resume. Without checkpointing, a spot preemption at minute 9 of a 10-minute render wastes 90% of the compute. With checkpointing every 30 seconds, you lose at most 30 seconds of work. But checkpointing a diffusion model's intermediate state to S3 takes 5–15 seconds and consumes memory bandwidth β€” so checkpoint frequency is a direct trade-off between wasted compute on preemption vs. overhead during normal execution. Most candidates never mention this. Staff+ candidates design the checkpoint interval as a function of spot interruption probability.


2. Requirements & Scope

Functional Requirements

  • Prompt submission: Users submit text prompts to generate videos (with optional parameters: duration, resolution, aspect ratio, style)

  • Async processing: Return a job ID immediately; user polls or receives a webhook when complete

  • Job queue with priority: Support priority tiers (paid users, free tier, internal)

  • Progress tracking: Real-time progress updates (percentage complete, current frame)

  • Result delivery: Generated video stored in object storage, downloadable via signed URL

  • Retry on failure: Automatic retry on worker crash or spot preemption (with checkpoint resume)

  • Cancellation: Users can cancel in-progress jobs

  • Rate limiting: Per-user and per-tier job submission limits

Non-Functional Requirements

Requirement
Target
Rationale

Job pickup latency

< 30s from submission

User should see "processing" quickly, not sit in unknown state

Generation throughput

100K videos/day

Mid-scale service (Sora reportedly handles millions)

Availability

99.9% (job acceptance)

Users must always be able to submit; processing can queue

Video generation SLA

< 5 min for 10s video (p95)

Competitive with Sora/Runway

Preemption recovery

< 60s to resume from checkpoint

Minimize wasted GPU time

Exactly-once generation

No duplicate or lost videos

Double-generation wastes expensive GPU time

Scale Estimation (Back-of-Envelope)

Why these numbers matter for the interview:

  • 1,042 concurrent GPUs means you need a large fleet with autoscaling β€” not a static pool

  • $1.2M/month compute cost means spot instance optimization is a business-critical decision, not a nice-to-have

  • 1M checkpoint writes/day at 1GB each means S3 is the only viable checkpoint store (not Redis, not local disk)

  • 14 writes/sec for job state is manageable for a single PostgreSQL instance β€” the bottleneck is GPU scheduling, not database

Staff+ Signal: The most important derived number is the cost of wasted compute from preemptions. With 730 spot instances at 10% interruption rate, ~73 jobs get preempted per hour. Without checkpointing, each wastes an average of 2.5 minutes of GPU time ($0.15 each). That's $262/day wasted. With 30-second checkpoints, waste drops to $0.03 per preemption β€” $53/day. The checkpoint system pays for itself in 2 days. This is the kind of math that makes an interviewer's eyes light up.


3. Phase 1: Single GPU Worker

Start with the simplest thing that works: one machine, one GPU, one job at a time.

How It Works

What Works at This Scale

  • Dead simple: No distributed systems. One process, one GPU. Easy to debug.

  • No coordination overhead: No leases, no heartbeats, no race conditions.

  • Model stays warm: Since there's only one worker, the model stays loaded in GPU VRAM between jobs. No cold start after the first job.

  • Handles ~288 videos/day: At 5 min/video, one GPU processes 12/hour Γ— 24 = 288 videos/day.

When Does Phase 1 Work?

Internal tool, prototype, or very small user base (< 300 videos/day). Perfect for validating the model and API contract before investing in distributed infrastructure.


4. Why Naive Fails (The Math)

Problem 1: Throughput Ceiling

Adding more GPUs means distributed scheduling. The moment you have >1 worker, you need a queue, job assignment, and failure handling.

Problem 2: Spot Preemption Destroys Work

At scale with 730 spot instances and 10% hourly interruption rate:

Metric
No Checkpoint
30s Checkpoint
2-min Checkpoint

GPU hours wasted/day

182 hours

6 hours

36 hours

Cost wasted/day

$637

$21

$126

Extra user wait time (avg)

+4 min

+1 min

+2.5 min

Checkpoint overhead

0%

~8% (5s save / 60s interval)

~4%

Problem 3: No Priority or Fairness

A free-tier user submitting 100 videos can block a paying customer's single urgent request. Without priority queues, first-come-first-served treats all jobs equally β€” which is unacceptable for a paid product.

Problem 4: GPU Cold Starts

When a new GPU worker spins up, it must load the model into VRAM β€” a process that takes 30–90 seconds for large diffusion models. If you're rapidly scaling up during peak hours, users see an extra minute of delay. This is worse on spot instances, which are provisioned fresh each time.

Staff+ Signal: The checkpoint overhead math is the key insight. Checkpointing every 30 seconds adds ~8% overhead to normal execution (5 seconds to serialize intermediate latents to S3 every 60 seconds of compute). But it reduces wasted compute by 97% during preemptions. The break-even point is: if preemption probability per job exceeds 3% (which it does on spot instances), checkpointing is strictly better. Most candidates either skip checkpointing entirely or add it without quantifying the trade-off.


5. Phase 2: Distributed Architecture

The key architectural insight: Treat GPU workers as unreliable, stateless execution units that lease jobs and checkpoint progress to durable storage, so any worker can resume any job from its last checkpoint.

How Real Companies Built This

Runway + Anyscale (Ray): Runway built Gen-3 Alpha on Anyscale's managed Ray platform. Their initial approach used KubeRay on Kubernetes, but it broke down when 4–5 researchers ran concurrent jobs β€” "researchers would accidentally set up their resources incorrectly and accidentally mess up someone else's job." They moved to Anyscale for managed multi-tenancy, achieving 13x faster model loading and 85% reduction in pipeline deployment time. Key insight: they separate preprocessing onto CPU pools and reserve GPUs strictly for inference/training β€” never mix workloads on the same GPU.

AMD ROCm Video Generation Serving: AMD's reference architecture uses Redis as the job queue (RPUSH for submission, BLPOP for workers) with per-job response queues (job_resp:<job_id>). Rank 0 of a Torchrun distributed group fetches jobs from Redis and broadcasts to all ranks. The separation of API server, Redis queue, and Torchrun workers allows independent scaling of each tier β€” a pattern directly applicable to our design.

Replicate: Replicate manages warm/cold model states across their GPU fleet. Popular models are kept "warm" (loaded in VRAM) for instant startup. Less-used models go "cold" and require 30–60 second cold starts. Custom models run in private containers with up to 1-minute cold boots. Their key insight: model warmth is a scheduling dimension β€” route jobs to workers that already have the model loaded.

AWS Spot Instance Best Practices: AWS provides a 2-minute interruption notice via EC2 metadata and EventBridge. Best practice is to catch the SIGTERM signal, checkpoint immediately, and gracefully shut down. For GPU workloads, AWS recommends checkpointing every 15 minutes for training, but video generation jobs (shorter, higher value per minute) benefit from more frequent checkpoints.

Key Data Structure: Job State Machine

Each job record carries:

  • checkpoint_uri: S3 path to latest checkpoint (null if none)

  • checkpoint_step: Diffusion step number at last checkpoint

  • lease_expires_at: Heartbeat-based TTL

  • worker_id: Which worker currently holds the lease

  • attempt: Current attempt number (for idempotency)


6. Core Component Deep Dives

6.1 Priority Queue (Redis Sorted Set)

The queue must support priority ordering, not just FIFO. A paying customer's job should jump ahead of free-tier jobs.

Implementation: Redis sorted set with composite score:

Worker pulls jobs: ZPOPMIN job_queue atomically removes and returns the highest-priority (lowest-score) job.

Why not Kafka? Kafka provides ordering within partitions but not cross-partition priority. You'd need separate topics per priority level with weighted consumption β€” more complex than a sorted set. Redis is sufficient at our scale (14 writes/sec is trivial for Redis).

6.2 Job Scheduler (Lease Manager)

The scheduler is responsible for two things: assigning jobs to workers and detecting dead workers.

Job acquisition flow:

Lease renewal (heartbeat):

Workers call /heartbeat every 60 seconds:

Lease reaper (detects dead workers):

A cron job runs every 30 seconds:

Why 5-minute lease TTL? Video generation jobs are long-running (2–15 minutes). A 5-minute TTL with 60-second heartbeats gives 5 missed heartbeats before a job is considered abandoned. This balances fast detection (don't wait 30 minutes) with resilience to transient network issues (don't false-positive on a 2-second blip).

Staff+ Signal: The lease TTL must be tuned to the preemption notice window. AWS gives a 2-minute warning before spot termination. If your lease TTL is 5 minutes and the worker receives a SIGTERM, it has 2 minutes to: (1) checkpoint current state to S3, (2) release the lease explicitly, and (3) shut down gracefully. If the worker uses the full 2 minutes for checkpointing and graceful shutdown, the job is re-queued immediately β€” no waiting for lease expiry. The 5-minute TTL is the fallback for ungraceful deaths (kernel panic, network partition, OOM kill) where no SIGTERM is received.

6.3 GPU Worker (Checkpoint/Resume Engine)

The worker is the most complex component. It must:

  1. Pull a job from the scheduler

  2. Check for existing checkpoint (resume vs. fresh start)

  3. Execute diffusion steps with periodic checkpointing

  4. Handle SIGTERM gracefully (spot preemption)

  5. Upload final video to S3

Checkpoint format:

Checkpoint interval calculation:

6.4 Notification Service (Progress + Completion)

Users need real-time progress updates and completion notifications.

SSE for progress streaming:

Webhook for async consumers:

Implementation: The notification service watches the jobs table via CDC (PostgreSQL WAL β†’ Debezium β†’ Kafka topic). When a job status changes, it pushes to connected SSE clients and fires webhooks.

Staff+ Signal: Progress reporting has a subtle UX challenge. During a spot preemption, progress jumps backward β€” from 85% to "re-queuing" to 0% to 85% (after checkpoint restore). Most candidates design linear progress bars. Staff+ candidates design for non-monotonic progress with a state like "resuming from checkpoint at 85%" that explains the discontinuity. Sora's UI shows "processing" without a percentage for exactly this reason β€” it avoids the jarring backward jump.


7. The Scaling Journey

Stage 1: Single Worker (0–300 videos/day)

One H100, one process. Model stays warm in VRAM. No checkpointing needed (no spot instances at this scale β€” use on-demand). Cost: ~$2,500/month.

Limit: Throughput ceiling at 288 videos/day. No fault tolerance.

Stage 2: Worker Pool + Redis Queue (300–10K videos/day)

New capabilities:

  • Redis sorted set for priority queuing

  • Multiple GPU workers polling for jobs

  • PostgreSQL for job state and audit trail

  • S3 for video output storage

  • Basic heartbeat (5-min lease TTL)

Limit: On-demand GPUs only β€” cost is $0.97/video. At 10K videos/day, that's $9,700/day ($291K/month). No checkpointing β€” worker crash means full restart.

Stage 3: Spot Instances + Checkpoint/Resume (10K–100K videos/day)

New capabilities at this stage:

  • 70/30 spot/on-demand GPU split (saves ~50% vs. all on-demand)

  • Checkpoint/resume: intermediate latents saved to S3 every 60 seconds

  • SIGTERM handler: graceful checkpoint on spot preemption (2-min warning)

  • Lease reaper: detects ungraceful worker deaths, re-queues with checkpoint

  • Priority tiers: paid users jump the queue

  • Autoscaling: scale spot fleet based on queue depth

Cost at this stage:

Limit: Single-region. Single PostgreSQL. No model warmth optimization (cold starts on new spot instances).

Stage 4: Multi-Region Enterprise (100K+ videos/day)

Everything in Stage 3, plus:

  • Model cache layer: Pre-load model weights on persistent NVMe volumes. New spot instances mount the volume instead of downloading from S3 (cold start drops from 60s to 5s)

  • Multi-region deployment: GPU fleet in us-east-1, us-west-2, eu-west-1 with geo-routed job assignment

  • Tiered SLA: Enterprise customers get dedicated on-demand GPU pools with guaranteed capacity

  • Cost optimization: Predictive spot pricing β€” shift non-urgent jobs to regions/times with cheapest spot rates

  • A/B model routing: Route jobs to workers running different model versions for quality comparison

Staff+ Signal: At Stage 4, the team structure matters as much as the architecture. The GPU fleet should be owned by an infrastructure team. The scheduler and job state by a platform team. The model loading and inference by an ML engineering team. The API and user experience by a product team. Drawing these boundaries wrong means four teams coordinate on every spot preemption incident. The scheduler/checkpoint system is the critical interface β€” it must be owned by a single team with clear SLOs for job completion rate and preemption recovery time.


8. Failure Modes & Resilience

Request Flow with Failure Handling

Failure Scenarios

Failure
Detection
Recovery
Blast Radius

Spot preemption (graceful)

SIGTERM signal

Checkpoint to S3, release lease, re-queue immediately

Single job; loses ≀2 min of progress

Worker OOM kill

Lease expires (5 min)

Reaper re-queues; resumes from last checkpoint

Single job; loses up to 60s of progress + 5 min detection delay

Worker network partition

Heartbeat fails; lease expires

Re-queued after 5 min; original worker may still be running (lease check on completion prevents double-write)

Single job; brief duplicate execution possible

Redis failure

Connection errors

Workers fall back to polling PostgreSQL directly (degraded mode)

All new job scheduling pauses; running jobs continue

PostgreSQL failure

Connection pool errors

Failover to synchronous replica; workers buffer heartbeats in memory

New job creation pauses; running jobs continue (workers have job payload in memory)

S3 checkpoint failure

Upload error, timeout

Retry with exponential backoff; if persistent, worker logs warning and continues without checkpoint

Increased risk: if preempted before next successful checkpoint, more progress lost

Model corruption in VRAM

NaN outputs, CUDA errors

Worker kills itself, job re-queued from checkpoint

Single job; worker restarts with fresh model load

Checkpoint corruption

Restore fails (bad tensor)

Fall back to previous checkpoint; if none valid, restart from scratch

Single job; worst case full restart

The Duplicate Execution Problem

The most subtle failure mode: Worker A is running Job #42, experiences a network partition (can't heartbeat), the lease expires, and Worker B picks up the job from checkpoint. Meanwhile, Worker A's network recovers and it finishes the job. Now both workers try to upload the final video.

Solution: Optimistic locking on job completion:

Worker A's completion fails (its worker_id no longer matches), and it discards its result. Worker B completes normally.

Staff+ Signal: The most operationally dangerous scenario isn't a single worker failure β€” it's a fleet-wide spot reclamation. Cloud providers sometimes reclaim large numbers of spot instances simultaneously (e.g., when a large customer requests on-demand capacity). If 70% of your fleet disappears in 2 minutes, you have 700+ jobs checkpointing to S3 simultaneously, creating a write storm. Your S3 request rate might spike from 100/sec to 10,000/sec, hitting S3's per-prefix rate limits (3,500 PUT/sec). The fix: partition checkpoints across multiple S3 prefixes (s3://checkpoints/{job_id_prefix}/{job_id}/) and use S3 Express One Zone for low-latency writes.


9. Data Model & Storage

Core Tables (PostgreSQL)

Checkpoint Storage (S3)

Video Storage (S3 + CDN)

Storage Engine Choice

Engine
Role
Why

PostgreSQL

Job state, user accounts, audit log

ACID for job state transitions; optimistic locking for lease management. 14 writes/sec is trivial.

Redis

Priority queue (sorted set)

Atomic ZPOPMIN for job acquisition; sub-millisecond latency; simple to operate at this write rate

S3

Checkpoints, final videos, model weights

Virtually unlimited storage; 11 nines durability; S3 Express One Zone for low-latency checkpoint writes

S3 + CloudFront

Video delivery to users

CDN-backed signed URLs for fast download; geo-distributed edge caching


10. Observability & Operations

Key Metrics

  • videogen_queue_depth{tier, priority} β€” pending jobs per tier; the primary autoscaling signal

  • videogen_queue_wait_seconds{tier, quantile} β€” time from submission to worker pickup; the metric users feel most

  • videogen_generation_seconds{resolution, quantile} β€” actual GPU generation time; helps capacity planning

  • videogen_checkpoint_duration_seconds{quantile} β€” checkpoint save time; if growing, S3 may be throttled

  • videogen_preemptions_total{region} β€” spot preemption count; spikes indicate AWS capacity pressure

  • videogen_preemption_recovery_seconds{quantile} β€” time from preemption to resumed execution; SLO target < 60s

  • videogen_lease_expirations_total β€” ungraceful worker deaths; spikes mean OOM or infrastructure issues

  • videogen_gpu_utilization{worker, gpu_id} β€” GPU compute utilization; should be >90% during execution

  • videogen_wasted_gpu_seconds_total β€” GPU time lost to preemptions (before checkpoint); the cost metric

  • videogen_estimated_drain_time_seconds{tier} β€” (queue_depth Γ— avg_generation_time) / workers; best single health metric

Distributed Tracing

A full trace for a single video generation:

Alerting Strategy

Alert
Condition
Severity
Action

Queue drain time > 10 min

Sustained for 15 min

P2

Scale GPU fleet; check spot availability

Preemption rate > 20%/hour

Spike detection

P1

Shift to more on-demand capacity; check AWS spot advisor

Checkpoint save > 30s

p95 sustained 10 min

P2

Check S3 throttling; verify prefix sharding

Lease expirations > 5/min

Sustained 5 min

P1

Worker fleet issue: OOM kills, CUDA errors, or network partition

GPU utilization < 50%

Any worker sustained 5 min

P3

Worker may be stuck; investigate CUDA hang

Wasted GPU hours > 20/day

Daily aggregate

P3

Review checkpoint frequency; investigate preemption patterns

Job completion rate < 95%

Per-hour measurement

P1

Systematic failure; check model, S3, or infrastructure

Staff+ Signal: The most actionable on-call dashboard isn't per-metric β€” it's a cost efficiency heatmap showing GPU utilization Γ— preemption waste Γ— queue depth per region. A region with high GPU utilization, low waste, and low queue depth is healthy. A region with low utilization means cold start problems (model not loaded). A region with high waste means checkpoint frequency needs tuning. A region with high queue depth means scaling lag. Each combination points to a different root cause and different runbook entry.


11. Design Trade-offs

Decision
Option A
Option B
Recommended
Why

Queue backend

Redis Sorted Set

Kafka with priority topics

Redis

At 14 writes/sec, Redis is simpler and supports true priority ordering with ZPOPMIN. Kafka would require separate topics per priority with weighted consumption. Two-way door β€” migrate to Kafka if >10K jobs/sec.

Checkpoint storage

S3 Standard

S3 Express One Zone

S3 Express for hot checkpoints

Express One Zone has single-digit ms latency vs. 50ms for Standard. Checkpoints are written every 60s β€” latency matters. 10x more expensive per GB but checkpoints are short-lived (deleted after 24h).

Worker model

Pull (worker polls for jobs)

Push (scheduler dispatches to workers)

Pull

Workers self-register by polling. Scheduler doesn't need to track which workers are alive. Adding capacity = launching more workers. No server-side worker registry.

Spot vs. On-demand ratio

90/10 spot-heavy

50/50 balanced

70/30

90% spot saves more but increases preemption risk during fleet-wide reclamations. 50/50 wastes money. 70/30 balances cost savings ($640K/month) with resilience. Adjustable per region.

Checkpoint frequency

Fixed interval (60s)

Adaptive (based on spot interruption probability)

Fixed 60s for v1

Fixed is simpler to reason about and debug. Adaptive is better in theory (checkpoint more often when spot market is volatile) but requires real-time spot pricing feeds and adds complexity. Ship fixed, iterate to adaptive.

Job assignment

Random worker

Affinity-based (prefer workers with model loaded)

Affinity-based

Avoids 30-60s cold start. If no warm worker is available, assign to any worker. Cold start is better than queue delay.

Progress reporting

Polling (client polls GET /status)

Push (SSE/WebSocket)

Both

SSE for interactive users (watching in browser). Polling for API consumers (simpler integration). Webhook for async consumers.

Staff+ Signal: The spot/on-demand ratio is a one-way door during peak hours. If you're running 70% spot and AWS reclaims half your spot fleet, you're suddenly running at 65% capacity with a full queue. Scaling up on-demand takes 2-5 minutes (instance launch + model load). During that window, your queue grows by ~200 jobs. The mitigation is to maintain a reserve pool of pre-warmed on-demand instances at 10% of fleet size that can absorb spot reclamation shocks instantly. This costs an extra $120K/month but prevents SLA violations during fleet-wide preemption events β€” a trade-off that only makes sense at >$500K/month total spend.


12. Common Interview Mistakes

  1. Treating GPU workers like web servers: "I'll use a load balancer to distribute requests across GPUs." β†’ GPUs process one job at a time for 5 minutes. This isn't HTTP request routing β€” it's job scheduling. A load balancer sends the next request to any available server. A job scheduler must track which workers are busy, handle 5-minute leases, and manage checkpoints. Staff+ answer: "This is a job queue pattern, not a request routing pattern. Workers pull from a priority queue when they're free."

  2. Ignoring spot preemption entirely: "Workers run the job and report results." β†’ What happens when the worker dies at 90% completion? Staff+ answer: "Workers checkpoint intermediate state to S3 every 60 seconds. On preemption, the SIGTERM handler saves one final checkpoint. The re-queued job resumes from the last checkpoint on a new worker."

  3. No priority differentiation: "All jobs go in one FIFO queue." β†’ A paying customer's job shouldn't wait behind 1,000 free-tier jobs. Staff+ answer: "Redis sorted set with composite score: (priority_tier Γ— 1B) + timestamp. Paid jobs always dequeue before free-tier jobs submitted earlier."

  4. Designing synchronous video generation: "The API generates the video and returns it in the response." β†’ A 5-minute generation blocks the HTTP connection. Staff+ answer: "Return 202 Accepted with a job_id immediately. Client polls for status or subscribes to SSE/webhook for completion notification."

  5. No cost awareness: "We'll use on-demand H100s for everything." β†’ At 100K videos/day, that's $2.9M/month vs. $960K/month with spot optimization. Staff+ answer: "70/30 spot/on-demand split with checkpoint/resume saves $1.7M/month. The checkpoint system pays for itself in 2 days."

  6. Forgetting model cold starts: "A new worker picks up a job and runs it." β†’ Loading a diffusion model into GPU VRAM takes 30-90 seconds. If you're scaling up 100 workers during peak, every user waits an extra minute. Staff+ answer: "Pre-baked container images with model weights on persistent NVMe volumes. New workers mount the volume instead of downloading from S3. Cold start drops from 60s to 5s."

  7. No checkpoint corruption handling: "If the worker crashes, we restore from checkpoint." β†’ What if the checkpoint itself is corrupted (partial write during crash)? Staff+ answer: "Checkpoint writes are atomic at the S3 object level β€” either the full object is written or it isn't. Keep the previous checkpoint until the new one is confirmed. On restore failure, fall back to the previous checkpoint or restart from scratch."


13. Interview Cheat Sheet

Time Allocation (45-minute interview)

Phase
Time
What to Cover

Clarify requirements

3 min

Scale (100K videos/day), one-worker-one-job constraint, spot preemption, latency SLA

API design

4 min

Submit (202 Accepted + job_id), status polling, SSE progress, webhook

High-level design

8 min

Priority queue (Redis) β†’ Scheduler β†’ GPU workers (spot + on-demand) β†’ S3 (checkpoints + videos)

Deep dive: checkpoint/resume

10 min

Why it matters (cost math), checkpoint format, SIGTERM handler, resume flow, interval optimization

Deep dive: scheduling + leases

8 min

Lease-based acquisition, heartbeat, reaper, optimistic locking, exactly-once guarantees

Failure modes + scaling

8 min

Spot preemption (graceful vs. ungraceful), fleet-wide reclamation, cold starts, scaling journey

Trade-offs + wrap-up

4 min

Spot ratio, checkpoint frequency, pull vs. push, what I'd build next

Step-by-Step Answer Guide

  1. Clarify: "One GPU = one video at a time. Spot instances can die anytime. How many videos/day? What's the latency SLA? Is there priority between users?"

  2. Key insight: "The hard part isn't the queue β€” it's that workers can be killed at any time. A 10-minute render killed at minute 9 wastes $0.45 of GPU time without checkpointing."

  3. API: Submit returns 202 with job_id. Status via polling or SSE. Webhook on completion.

  4. Single machine: One GPU, one queue, SQLite. Works for 288 videos/day.

  5. Prove it fails: "100K videos/day needs 350 GPUs. At $0.97/video on-demand, that's $2.9M/month. Spot saves 70% but workers die randomly."

  6. Distributed architecture: Redis sorted set (priority queue) β†’ stateless scheduler β†’ GPU workers (70% spot, 30% on-demand) β†’ S3 (checkpoints every 60s + final videos).

  7. Checkpoint deep dive: "Every 60 seconds, the worker serializes intermediate latents to S3 (5 seconds, ~1GB). On SIGTERM, emergency checkpoint before shutdown. Re-queued job resumes from last checkpoint on any worker."

  8. Lease mechanism: "Worker acquires job via optimistic locking (conditional UPDATE). Heartbeat every 60s extends 5-min lease. Lease reaper re-queues expired jobs. Completion requires matching worker_id to prevent duplicate writes."

  9. Failure handling: "Graceful preemption: checkpoint + release lease β†’ 0 wait. Ungraceful: 5-min lease expiry β†’ re-queue from last checkpoint. Fleet-wide reclamation: spike on-demand reserve pool."

  10. Scaling: "Single GPU β†’ pool + Redis β†’ spot + checkpoints β†’ multi-region + model warmth."

  11. Trade-offs: "Spot ratio (70/30), checkpoint frequency (60s fixed), Redis vs. Kafka (Redis for priority), pull vs. push (pull for self-registering workers)."

What the Interviewer Wants to Hear

  • At L5/Senior: Job queue with workers, async API (202 Accepted), basic retry on failure. Mentions spot instances are unreliable.

  • At L6/Staff: Checkpoint/resume with quantified cost savings, lease-based exactly-once with optimistic locking, priority queues with tier isolation, SIGTERM handling for graceful preemption, cold start mitigation via model warmth routing. References how Runway/Replicate actually built their GPU scheduling.

  • At L7/Principal: Fleet-wide preemption resilience (reserve pool, S3 prefix sharding for checkpoint write storms), cost modeling as a first-class architectural constraint ($1.7M/month savings drives the checkpoint system), organizational ownership boundaries (infra team owns GPU fleet, platform team owns scheduler, ML team owns model/checkpoint format), multi-region spot arbitrage for cost optimization.


Written as a reference for staff-level system design interviews. The scheduling and checkpoint patterns here apply beyond video generation to any long-running GPU workload: model training, 3D rendering, scientific simulation, and large-scale batch inference.

Last updated