System Design: Text-to-Video Generation Pipeline (Sora-like)
From a Single GPU Worker to Preemption-Resilient Distributed Rendering: A Staff Engineer's Guide
Table of Contents
1. The Problem & Why It's Hard
You're asked to design the backend job scheduling and processing pipeline for a text-to-video AI service like OpenAI's Sora. Users submit text prompts, and the system generates videos β a process that takes minutes per request, consumes an entire GPU per job, and runs on a fleet of spot instances that can vanish at any moment.
On the surface, it's "just a job queue with GPU workers." The trap is thinking the hard part is execution. The hard part is that your compute infrastructure is actively hostile β spot instances get preempted mid-render, each worker can only process one video at a time, and a single 10-second video generation consumes the GPU resources equivalent to thousands of ChatGPT queries ($0.50β$2.00 actual compute cost per video).
The interviewer's real question: Can you design a job scheduling system where workers are unreliable, expensive, and single-threaded β and still guarantee every user's video gets generated exactly once, with graceful handling of preemptions that can kill a 10-minute render at minute 9?
The constraints that make this different from a typical job queue:
One worker = one job: A GPU worker can only process one video at a time. No batching, no multiplexing. This means your fleet size directly equals your concurrency.
Workers die without warning: Spot instances get a 2-minute termination notice (AWS) or none at all (some providers). A 10-minute render killed at minute 9 is 9 minutes of GPU time wasted β at $3/hour per H100, that's $0.45 burned.
Jobs are expensive and long-running: Unlike sub-second API calls, video generation takes 2β15 minutes. A retry doesn't cost milliseconds β it costs dollars and minutes of user wait time.
GPU scarcity is real: H100s cost $3β4/hour on-demand. Spot instances save 60β70% but have 5β15% interruption rates. Your scheduler must optimize for cost while maintaining throughput.
Staff+ Signal: The deceptively hard part isn't the queue or the workers β it's checkpoint/resume. Without checkpointing, a spot preemption at minute 9 of a 10-minute render wastes 90% of the compute. With checkpointing every 30 seconds, you lose at most 30 seconds of work. But checkpointing a diffusion model's intermediate state to S3 takes 5β15 seconds and consumes memory bandwidth β so checkpoint frequency is a direct trade-off between wasted compute on preemption vs. overhead during normal execution. Most candidates never mention this. Staff+ candidates design the checkpoint interval as a function of spot interruption probability.
2. Requirements & Scope
Functional Requirements
Prompt submission: Users submit text prompts to generate videos (with optional parameters: duration, resolution, aspect ratio, style)
Async processing: Return a job ID immediately; user polls or receives a webhook when complete
Job queue with priority: Support priority tiers (paid users, free tier, internal)
Progress tracking: Real-time progress updates (percentage complete, current frame)
Result delivery: Generated video stored in object storage, downloadable via signed URL
Retry on failure: Automatic retry on worker crash or spot preemption (with checkpoint resume)
Cancellation: Users can cancel in-progress jobs
Rate limiting: Per-user and per-tier job submission limits
Non-Functional Requirements
Job pickup latency
< 30s from submission
User should see "processing" quickly, not sit in unknown state
Generation throughput
100K videos/day
Mid-scale service (Sora reportedly handles millions)
Availability
99.9% (job acceptance)
Users must always be able to submit; processing can queue
Video generation SLA
< 5 min for 10s video (p95)
Competitive with Sora/Runway
Preemption recovery
< 60s to resume from checkpoint
Minimize wasted GPU time
Exactly-once generation
No duplicate or lost videos
Double-generation wastes expensive GPU time
Scale Estimation (Back-of-Envelope)
Why these numbers matter for the interview:
1,042 concurrent GPUs means you need a large fleet with autoscaling β not a static pool
$1.2M/month compute cost means spot instance optimization is a business-critical decision, not a nice-to-have
1M checkpoint writes/day at 1GB each means S3 is the only viable checkpoint store (not Redis, not local disk)
14 writes/sec for job state is manageable for a single PostgreSQL instance β the bottleneck is GPU scheduling, not database
Staff+ Signal: The most important derived number is the cost of wasted compute from preemptions. With 730 spot instances at 10% interruption rate, ~73 jobs get preempted per hour. Without checkpointing, each wastes an average of 2.5 minutes of GPU time ($0.15 each). That's $262/day wasted. With 30-second checkpoints, waste drops to $0.03 per preemption β $53/day. The checkpoint system pays for itself in 2 days. This is the kind of math that makes an interviewer's eyes light up.
3. Phase 1: Single GPU Worker
Start with the simplest thing that works: one machine, one GPU, one job at a time.
How It Works
What Works at This Scale
Dead simple: No distributed systems. One process, one GPU. Easy to debug.
No coordination overhead: No leases, no heartbeats, no race conditions.
Model stays warm: Since there's only one worker, the model stays loaded in GPU VRAM between jobs. No cold start after the first job.
Handles ~288 videos/day: At 5 min/video, one GPU processes 12/hour Γ 24 = 288 videos/day.
When Does Phase 1 Work?
Internal tool, prototype, or very small user base (< 300 videos/day). Perfect for validating the model and API contract before investing in distributed infrastructure.
4. Why Naive Fails (The Math)
Problem 1: Throughput Ceiling
Adding more GPUs means distributed scheduling. The moment you have >1 worker, you need a queue, job assignment, and failure handling.
Problem 2: Spot Preemption Destroys Work
At scale with 730 spot instances and 10% hourly interruption rate:
GPU hours wasted/day
182 hours
6 hours
36 hours
Cost wasted/day
$637
$21
$126
Extra user wait time (avg)
+4 min
+1 min
+2.5 min
Checkpoint overhead
0%
~8% (5s save / 60s interval)
~4%
Problem 3: No Priority or Fairness
A free-tier user submitting 100 videos can block a paying customer's single urgent request. Without priority queues, first-come-first-served treats all jobs equally β which is unacceptable for a paid product.
Problem 4: GPU Cold Starts
When a new GPU worker spins up, it must load the model into VRAM β a process that takes 30β90 seconds for large diffusion models. If you're rapidly scaling up during peak hours, users see an extra minute of delay. This is worse on spot instances, which are provisioned fresh each time.
Staff+ Signal: The checkpoint overhead math is the key insight. Checkpointing every 30 seconds adds ~8% overhead to normal execution (5 seconds to serialize intermediate latents to S3 every 60 seconds of compute). But it reduces wasted compute by 97% during preemptions. The break-even point is: if preemption probability per job exceeds 3% (which it does on spot instances), checkpointing is strictly better. Most candidates either skip checkpointing entirely or add it without quantifying the trade-off.
5. Phase 2: Distributed Architecture
The key architectural insight: Treat GPU workers as unreliable, stateless execution units that lease jobs and checkpoint progress to durable storage, so any worker can resume any job from its last checkpoint.
How Real Companies Built This
Runway + Anyscale (Ray): Runway built Gen-3 Alpha on Anyscale's managed Ray platform. Their initial approach used KubeRay on Kubernetes, but it broke down when 4β5 researchers ran concurrent jobs β "researchers would accidentally set up their resources incorrectly and accidentally mess up someone else's job." They moved to Anyscale for managed multi-tenancy, achieving 13x faster model loading and 85% reduction in pipeline deployment time. Key insight: they separate preprocessing onto CPU pools and reserve GPUs strictly for inference/training β never mix workloads on the same GPU.
AMD ROCm Video Generation Serving: AMD's reference architecture uses Redis as the job queue (RPUSH for submission, BLPOP for workers) with per-job response queues (job_resp:<job_id>). Rank 0 of a Torchrun distributed group fetches jobs from Redis and broadcasts to all ranks. The separation of API server, Redis queue, and Torchrun workers allows independent scaling of each tier β a pattern directly applicable to our design.
Replicate: Replicate manages warm/cold model states across their GPU fleet. Popular models are kept "warm" (loaded in VRAM) for instant startup. Less-used models go "cold" and require 30β60 second cold starts. Custom models run in private containers with up to 1-minute cold boots. Their key insight: model warmth is a scheduling dimension β route jobs to workers that already have the model loaded.
AWS Spot Instance Best Practices: AWS provides a 2-minute interruption notice via EC2 metadata and EventBridge. Best practice is to catch the SIGTERM signal, checkpoint immediately, and gracefully shut down. For GPU workloads, AWS recommends checkpointing every 15 minutes for training, but video generation jobs (shorter, higher value per minute) benefit from more frequent checkpoints.
Key Data Structure: Job State Machine
Each job record carries:
checkpoint_uri: S3 path to latest checkpoint (null if none)checkpoint_step: Diffusion step number at last checkpointlease_expires_at: Heartbeat-based TTLworker_id: Which worker currently holds the leaseattempt: Current attempt number (for idempotency)
6. Core Component Deep Dives
6.1 Priority Queue (Redis Sorted Set)
The queue must support priority ordering, not just FIFO. A paying customer's job should jump ahead of free-tier jobs.
Implementation: Redis sorted set with composite score:
Worker pulls jobs: ZPOPMIN job_queue atomically removes and returns the highest-priority (lowest-score) job.
Why not Kafka? Kafka provides ordering within partitions but not cross-partition priority. You'd need separate topics per priority level with weighted consumption β more complex than a sorted set. Redis is sufficient at our scale (14 writes/sec is trivial for Redis).
6.2 Job Scheduler (Lease Manager)
The scheduler is responsible for two things: assigning jobs to workers and detecting dead workers.
Job acquisition flow:
Lease renewal (heartbeat):
Workers call /heartbeat every 60 seconds:
Lease reaper (detects dead workers):
A cron job runs every 30 seconds:
Why 5-minute lease TTL? Video generation jobs are long-running (2β15 minutes). A 5-minute TTL with 60-second heartbeats gives 5 missed heartbeats before a job is considered abandoned. This balances fast detection (don't wait 30 minutes) with resilience to transient network issues (don't false-positive on a 2-second blip).
Staff+ Signal: The lease TTL must be tuned to the preemption notice window. AWS gives a 2-minute warning before spot termination. If your lease TTL is 5 minutes and the worker receives a SIGTERM, it has 2 minutes to: (1) checkpoint current state to S3, (2) release the lease explicitly, and (3) shut down gracefully. If the worker uses the full 2 minutes for checkpointing and graceful shutdown, the job is re-queued immediately β no waiting for lease expiry. The 5-minute TTL is the fallback for ungraceful deaths (kernel panic, network partition, OOM kill) where no SIGTERM is received.
6.3 GPU Worker (Checkpoint/Resume Engine)
The worker is the most complex component. It must:
Pull a job from the scheduler
Check for existing checkpoint (resume vs. fresh start)
Execute diffusion steps with periodic checkpointing
Handle SIGTERM gracefully (spot preemption)
Upload final video to S3
Checkpoint format:
Checkpoint interval calculation:
6.4 Notification Service (Progress + Completion)
Users need real-time progress updates and completion notifications.
SSE for progress streaming:
Webhook for async consumers:
Implementation: The notification service watches the jobs table via CDC (PostgreSQL WAL β Debezium β Kafka topic). When a job status changes, it pushes to connected SSE clients and fires webhooks.
Staff+ Signal: Progress reporting has a subtle UX challenge. During a spot preemption, progress jumps backward β from 85% to "re-queuing" to 0% to 85% (after checkpoint restore). Most candidates design linear progress bars. Staff+ candidates design for non-monotonic progress with a state like "resuming from checkpoint at 85%" that explains the discontinuity. Sora's UI shows "processing" without a percentage for exactly this reason β it avoids the jarring backward jump.
7. The Scaling Journey
Stage 1: Single Worker (0β300 videos/day)
One H100, one process. Model stays warm in VRAM. No checkpointing needed (no spot instances at this scale β use on-demand). Cost: ~$2,500/month.
Limit: Throughput ceiling at 288 videos/day. No fault tolerance.
Stage 2: Worker Pool + Redis Queue (300β10K videos/day)
New capabilities:
Redis sorted set for priority queuing
Multiple GPU workers polling for jobs
PostgreSQL for job state and audit trail
S3 for video output storage
Basic heartbeat (5-min lease TTL)
Limit: On-demand GPUs only β cost is $0.97/video. At 10K videos/day, that's $9,700/day ($291K/month). No checkpointing β worker crash means full restart.
Stage 3: Spot Instances + Checkpoint/Resume (10Kβ100K videos/day)
New capabilities at this stage:
70/30 spot/on-demand GPU split (saves ~50% vs. all on-demand)
Checkpoint/resume: intermediate latents saved to S3 every 60 seconds
SIGTERM handler: graceful checkpoint on spot preemption (2-min warning)
Lease reaper: detects ungraceful worker deaths, re-queues with checkpoint
Priority tiers: paid users jump the queue
Autoscaling: scale spot fleet based on queue depth
Cost at this stage:
Limit: Single-region. Single PostgreSQL. No model warmth optimization (cold starts on new spot instances).
Stage 4: Multi-Region Enterprise (100K+ videos/day)
Everything in Stage 3, plus:
Model cache layer: Pre-load model weights on persistent NVMe volumes. New spot instances mount the volume instead of downloading from S3 (cold start drops from 60s to 5s)
Multi-region deployment: GPU fleet in us-east-1, us-west-2, eu-west-1 with geo-routed job assignment
Tiered SLA: Enterprise customers get dedicated on-demand GPU pools with guaranteed capacity
Cost optimization: Predictive spot pricing β shift non-urgent jobs to regions/times with cheapest spot rates
A/B model routing: Route jobs to workers running different model versions for quality comparison
Staff+ Signal: At Stage 4, the team structure matters as much as the architecture. The GPU fleet should be owned by an infrastructure team. The scheduler and job state by a platform team. The model loading and inference by an ML engineering team. The API and user experience by a product team. Drawing these boundaries wrong means four teams coordinate on every spot preemption incident. The scheduler/checkpoint system is the critical interface β it must be owned by a single team with clear SLOs for job completion rate and preemption recovery time.
8. Failure Modes & Resilience
Request Flow with Failure Handling
Failure Scenarios
Spot preemption (graceful)
SIGTERM signal
Checkpoint to S3, release lease, re-queue immediately
Single job; loses β€2 min of progress
Worker OOM kill
Lease expires (5 min)
Reaper re-queues; resumes from last checkpoint
Single job; loses up to 60s of progress + 5 min detection delay
Worker network partition
Heartbeat fails; lease expires
Re-queued after 5 min; original worker may still be running (lease check on completion prevents double-write)
Single job; brief duplicate execution possible
Redis failure
Connection errors
Workers fall back to polling PostgreSQL directly (degraded mode)
All new job scheduling pauses; running jobs continue
PostgreSQL failure
Connection pool errors
Failover to synchronous replica; workers buffer heartbeats in memory
New job creation pauses; running jobs continue (workers have job payload in memory)
S3 checkpoint failure
Upload error, timeout
Retry with exponential backoff; if persistent, worker logs warning and continues without checkpoint
Increased risk: if preempted before next successful checkpoint, more progress lost
Model corruption in VRAM
NaN outputs, CUDA errors
Worker kills itself, job re-queued from checkpoint
Single job; worker restarts with fresh model load
Checkpoint corruption
Restore fails (bad tensor)
Fall back to previous checkpoint; if none valid, restart from scratch
Single job; worst case full restart
The Duplicate Execution Problem
The most subtle failure mode: Worker A is running Job #42, experiences a network partition (can't heartbeat), the lease expires, and Worker B picks up the job from checkpoint. Meanwhile, Worker A's network recovers and it finishes the job. Now both workers try to upload the final video.
Solution: Optimistic locking on job completion:
Worker A's completion fails (its worker_id no longer matches), and it discards its result. Worker B completes normally.
Staff+ Signal: The most operationally dangerous scenario isn't a single worker failure β it's a fleet-wide spot reclamation. Cloud providers sometimes reclaim large numbers of spot instances simultaneously (e.g., when a large customer requests on-demand capacity). If 70% of your fleet disappears in 2 minutes, you have 700+ jobs checkpointing to S3 simultaneously, creating a write storm. Your S3 request rate might spike from 100/sec to 10,000/sec, hitting S3's per-prefix rate limits (3,500 PUT/sec). The fix: partition checkpoints across multiple S3 prefixes (
s3://checkpoints/{job_id_prefix}/{job_id}/) and use S3 Express One Zone for low-latency writes.
9. Data Model & Storage
Core Tables (PostgreSQL)
Checkpoint Storage (S3)
Video Storage (S3 + CDN)
Storage Engine Choice
PostgreSQL
Job state, user accounts, audit log
ACID for job state transitions; optimistic locking for lease management. 14 writes/sec is trivial.
Redis
Priority queue (sorted set)
Atomic ZPOPMIN for job acquisition; sub-millisecond latency; simple to operate at this write rate
S3
Checkpoints, final videos, model weights
Virtually unlimited storage; 11 nines durability; S3 Express One Zone for low-latency checkpoint writes
S3 + CloudFront
Video delivery to users
CDN-backed signed URLs for fast download; geo-distributed edge caching
10. Observability & Operations
Key Metrics
videogen_queue_depth{tier, priority}β pending jobs per tier; the primary autoscaling signalvideogen_queue_wait_seconds{tier, quantile}β time from submission to worker pickup; the metric users feel mostvideogen_generation_seconds{resolution, quantile}β actual GPU generation time; helps capacity planningvideogen_checkpoint_duration_seconds{quantile}β checkpoint save time; if growing, S3 may be throttledvideogen_preemptions_total{region}β spot preemption count; spikes indicate AWS capacity pressurevideogen_preemption_recovery_seconds{quantile}β time from preemption to resumed execution; SLO target < 60svideogen_lease_expirations_totalβ ungraceful worker deaths; spikes mean OOM or infrastructure issuesvideogen_gpu_utilization{worker, gpu_id}β GPU compute utilization; should be >90% during executionvideogen_wasted_gpu_seconds_totalβ GPU time lost to preemptions (before checkpoint); the cost metricvideogen_estimated_drain_time_seconds{tier}β (queue_depth Γ avg_generation_time) / workers; best single health metric
Distributed Tracing
A full trace for a single video generation:
Alerting Strategy
Queue drain time > 10 min
Sustained for 15 min
P2
Scale GPU fleet; check spot availability
Preemption rate > 20%/hour
Spike detection
P1
Shift to more on-demand capacity; check AWS spot advisor
Checkpoint save > 30s
p95 sustained 10 min
P2
Check S3 throttling; verify prefix sharding
Lease expirations > 5/min
Sustained 5 min
P1
Worker fleet issue: OOM kills, CUDA errors, or network partition
GPU utilization < 50%
Any worker sustained 5 min
P3
Worker may be stuck; investigate CUDA hang
Wasted GPU hours > 20/day
Daily aggregate
P3
Review checkpoint frequency; investigate preemption patterns
Job completion rate < 95%
Per-hour measurement
P1
Systematic failure; check model, S3, or infrastructure
Staff+ Signal: The most actionable on-call dashboard isn't per-metric β it's a cost efficiency heatmap showing GPU utilization Γ preemption waste Γ queue depth per region. A region with high GPU utilization, low waste, and low queue depth is healthy. A region with low utilization means cold start problems (model not loaded). A region with high waste means checkpoint frequency needs tuning. A region with high queue depth means scaling lag. Each combination points to a different root cause and different runbook entry.
11. Design Trade-offs
Queue backend
Redis Sorted Set
Kafka with priority topics
Redis
At 14 writes/sec, Redis is simpler and supports true priority ordering with ZPOPMIN. Kafka would require separate topics per priority with weighted consumption. Two-way door β migrate to Kafka if >10K jobs/sec.
Checkpoint storage
S3 Standard
S3 Express One Zone
S3 Express for hot checkpoints
Express One Zone has single-digit ms latency vs. 50ms for Standard. Checkpoints are written every 60s β latency matters. 10x more expensive per GB but checkpoints are short-lived (deleted after 24h).
Worker model
Pull (worker polls for jobs)
Push (scheduler dispatches to workers)
Pull
Workers self-register by polling. Scheduler doesn't need to track which workers are alive. Adding capacity = launching more workers. No server-side worker registry.
Spot vs. On-demand ratio
90/10 spot-heavy
50/50 balanced
70/30
90% spot saves more but increases preemption risk during fleet-wide reclamations. 50/50 wastes money. 70/30 balances cost savings ($640K/month) with resilience. Adjustable per region.
Checkpoint frequency
Fixed interval (60s)
Adaptive (based on spot interruption probability)
Fixed 60s for v1
Fixed is simpler to reason about and debug. Adaptive is better in theory (checkpoint more often when spot market is volatile) but requires real-time spot pricing feeds and adds complexity. Ship fixed, iterate to adaptive.
Job assignment
Random worker
Affinity-based (prefer workers with model loaded)
Affinity-based
Avoids 30-60s cold start. If no warm worker is available, assign to any worker. Cold start is better than queue delay.
Progress reporting
Polling (client polls GET /status)
Push (SSE/WebSocket)
Both
SSE for interactive users (watching in browser). Polling for API consumers (simpler integration). Webhook for async consumers.
Staff+ Signal: The spot/on-demand ratio is a one-way door during peak hours. If you're running 70% spot and AWS reclaims half your spot fleet, you're suddenly running at 65% capacity with a full queue. Scaling up on-demand takes 2-5 minutes (instance launch + model load). During that window, your queue grows by ~200 jobs. The mitigation is to maintain a reserve pool of pre-warmed on-demand instances at 10% of fleet size that can absorb spot reclamation shocks instantly. This costs an extra $120K/month but prevents SLA violations during fleet-wide preemption events β a trade-off that only makes sense at >$500K/month total spend.
12. Common Interview Mistakes
Treating GPU workers like web servers: "I'll use a load balancer to distribute requests across GPUs." β GPUs process one job at a time for 5 minutes. This isn't HTTP request routing β it's job scheduling. A load balancer sends the next request to any available server. A job scheduler must track which workers are busy, handle 5-minute leases, and manage checkpoints. Staff+ answer: "This is a job queue pattern, not a request routing pattern. Workers pull from a priority queue when they're free."
Ignoring spot preemption entirely: "Workers run the job and report results." β What happens when the worker dies at 90% completion? Staff+ answer: "Workers checkpoint intermediate state to S3 every 60 seconds. On preemption, the SIGTERM handler saves one final checkpoint. The re-queued job resumes from the last checkpoint on a new worker."
No priority differentiation: "All jobs go in one FIFO queue." β A paying customer's job shouldn't wait behind 1,000 free-tier jobs. Staff+ answer: "Redis sorted set with composite score: (priority_tier Γ 1B) + timestamp. Paid jobs always dequeue before free-tier jobs submitted earlier."
Designing synchronous video generation: "The API generates the video and returns it in the response." β A 5-minute generation blocks the HTTP connection. Staff+ answer: "Return 202 Accepted with a job_id immediately. Client polls for status or subscribes to SSE/webhook for completion notification."
No cost awareness: "We'll use on-demand H100s for everything." β At 100K videos/day, that's $2.9M/month vs. $960K/month with spot optimization. Staff+ answer: "70/30 spot/on-demand split with checkpoint/resume saves $1.7M/month. The checkpoint system pays for itself in 2 days."
Forgetting model cold starts: "A new worker picks up a job and runs it." β Loading a diffusion model into GPU VRAM takes 30-90 seconds. If you're scaling up 100 workers during peak, every user waits an extra minute. Staff+ answer: "Pre-baked container images with model weights on persistent NVMe volumes. New workers mount the volume instead of downloading from S3. Cold start drops from 60s to 5s."
No checkpoint corruption handling: "If the worker crashes, we restore from checkpoint." β What if the checkpoint itself is corrupted (partial write during crash)? Staff+ answer: "Checkpoint writes are atomic at the S3 object level β either the full object is written or it isn't. Keep the previous checkpoint until the new one is confirmed. On restore failure, fall back to the previous checkpoint or restart from scratch."
13. Interview Cheat Sheet
Time Allocation (45-minute interview)
Clarify requirements
3 min
Scale (100K videos/day), one-worker-one-job constraint, spot preemption, latency SLA
API design
4 min
Submit (202 Accepted + job_id), status polling, SSE progress, webhook
High-level design
8 min
Priority queue (Redis) β Scheduler β GPU workers (spot + on-demand) β S3 (checkpoints + videos)
Deep dive: checkpoint/resume
10 min
Why it matters (cost math), checkpoint format, SIGTERM handler, resume flow, interval optimization
Deep dive: scheduling + leases
8 min
Lease-based acquisition, heartbeat, reaper, optimistic locking, exactly-once guarantees
Failure modes + scaling
8 min
Spot preemption (graceful vs. ungraceful), fleet-wide reclamation, cold starts, scaling journey
Trade-offs + wrap-up
4 min
Spot ratio, checkpoint frequency, pull vs. push, what I'd build next
Step-by-Step Answer Guide
Clarify: "One GPU = one video at a time. Spot instances can die anytime. How many videos/day? What's the latency SLA? Is there priority between users?"
Key insight: "The hard part isn't the queue β it's that workers can be killed at any time. A 10-minute render killed at minute 9 wastes $0.45 of GPU time without checkpointing."
API: Submit returns 202 with job_id. Status via polling or SSE. Webhook on completion.
Single machine: One GPU, one queue, SQLite. Works for 288 videos/day.
Prove it fails: "100K videos/day needs 350 GPUs. At $0.97/video on-demand, that's $2.9M/month. Spot saves 70% but workers die randomly."
Distributed architecture: Redis sorted set (priority queue) β stateless scheduler β GPU workers (70% spot, 30% on-demand) β S3 (checkpoints every 60s + final videos).
Checkpoint deep dive: "Every 60 seconds, the worker serializes intermediate latents to S3 (5 seconds, ~1GB). On SIGTERM, emergency checkpoint before shutdown. Re-queued job resumes from last checkpoint on any worker."
Lease mechanism: "Worker acquires job via optimistic locking (conditional UPDATE). Heartbeat every 60s extends 5-min lease. Lease reaper re-queues expired jobs. Completion requires matching worker_id to prevent duplicate writes."
Failure handling: "Graceful preemption: checkpoint + release lease β 0 wait. Ungraceful: 5-min lease expiry β re-queue from last checkpoint. Fleet-wide reclamation: spike on-demand reserve pool."
Scaling: "Single GPU β pool + Redis β spot + checkpoints β multi-region + model warmth."
Trade-offs: "Spot ratio (70/30), checkpoint frequency (60s fixed), Redis vs. Kafka (Redis for priority), pull vs. push (pull for self-registering workers)."
What the Interviewer Wants to Hear
At L5/Senior: Job queue with workers, async API (202 Accepted), basic retry on failure. Mentions spot instances are unreliable.
At L6/Staff: Checkpoint/resume with quantified cost savings, lease-based exactly-once with optimistic locking, priority queues with tier isolation, SIGTERM handling for graceful preemption, cold start mitigation via model warmth routing. References how Runway/Replicate actually built their GPU scheduling.
At L7/Principal: Fleet-wide preemption resilience (reserve pool, S3 prefix sharding for checkpoint write storms), cost modeling as a first-class architectural constraint ($1.7M/month savings drives the checkpoint system), organizational ownership boundaries (infra team owns GPU fleet, platform team owns scheduler, ML team owns model/checkpoint format), multi-region spot arbitrage for cost optimization.
Written as a reference for staff-level system design interviews. The scheduling and checkpoint patterns here apply beyond video generation to any long-running GPU workload: model training, 3D rendering, scientific simulation, and large-scale batch inference.
Last updated