System Design: CI/CD Pipeline Like GitHub Actions
From a Single Runner to Enterprise-Scale Distributed Execution: A Staff Engineer's Guide
Table of Contents
1. The Problem & Why It's Hard
You're asked to design a CI/CD execution system like GitHub Actions. When a developer pushes code, the system must parse a workflow definition (YAML), execute a series of steps (build, test, deploy) on isolated compute, and report results in real time. Millions of workflow runs per day across hundreds of thousands of repositories.
On the surface, it's "just run some shell commands when code is pushed." The trap is thinking the hard part is execution. The hard part is guaranteeing that every job runs exactly once β even when workers crash, networks partition, and thousands of jobs are queued simultaneously.
The interviewer's real question: Can you design a multi-tenant job execution system that guarantees exactly-once semantics, scales horizontally, and recovers gracefully from worker crashes β while explaining the state machine that makes it all work?
The interviewer isn't asking you to build GitHub Actions. They're testing whether you understand:
How to turn a declarative workflow into an executable plan
How to distribute that plan across unreliable workers
How to ensure no job is skipped or double-executed
How to provide real-time feedback to users watching their builds
Staff+ Signal: "Exactly-once execution" is mentioned in 3 out of 5 interview stories for this question. Most candidates hand-wave it. The staff-level answer is: exactly-once is impossible in a distributed system β what you actually build is at-least-once delivery + idempotent execution. The lease-based job acquisition pattern (heartbeat + TTL) is the mechanism. If you can explain why this works and where it breaks, you've answered the core question.
2. Clarifying Questions
Before drawing a single box, ask these questions. Each one narrows the design space and shows the interviewer you think before you build.
Questions to Ask
"Are jobs in a workflow independent, or do they have dependencies?"
Determines whether you need a DAG scheduler or a simple linear executor
Most interviewers say linear first β jobs run sequentially. This is a gift: it simplifies scheduling enormously. Offer DAG as an extension.
"How many concurrent workflows do we need to support?"
Drives queue depth, worker fleet size, and database throughput
If the answer is "many customers, like GitHub" β design for multi-tenancy from the start
"Do users need real-time log streaming, or is post-completion sufficient?"
Determines whether you need WebSocket/SSE infrastructure
Real-time is almost always required β it's table stakes for CI/CD UX
"What's the isolation model? Containers? VMs? Bare metal?"
Affects security boundaries and startup latency
Docker containers for cost efficiency, ephemeral VMs for security-critical workloads
"Do we need to support self-hosted runners, or only managed compute?"
Determines whether runners pull work or the system pushes to them
Pull-based is the industry standard β it's firewall-friendly and scales independently
"What's the maximum workflow duration?"
Affects heartbeat TTL and resource reservation
GitHub Actions caps at 6 hours. This bounds your lease duration.
The Key Simplification
Interviewers often say: "Start with linear execution β jobs run one after another, not as a complex graph." This is intentional. They want to see:
You can build a working system with simple sequential execution
You know that DAG scheduling is the natural extension
You don't over-engineer from the start
Staff+ Signal: The best candidates ask about linear vs. DAG upfront, then say: "I'll design for linear execution first, then show how the same state machine extends to DAG with the
needs:keyword." This demonstrates progressive complexity β the hallmark of a senior system designer.
3. Requirements & Scope
Functional Requirements
Workflow registration: Users define workflows in YAML (trigger events, jobs, steps)
Event-driven triggering:
git push, pull request, schedule (cron), manual dispatchJob execution: Run steps (shell commands, Docker containers, reusable actions) in isolated environments
Step sequencing: Steps within a job execute sequentially, sharing a filesystem
Job dependencies: Support
needs:for job ordering (linear first, DAG as extension)Real-time status: Live log streaming and status updates (pending β running β success/failed)
Artifact management: Upload/download build artifacts between jobs
Secret management: Encrypted secrets injected at runtime, masked in logs
Multi-tenancy: Complete isolation between organizations/repositories
Non-Functional Requirements
Job pickup latency (p99)
< 30s from trigger event
Developer experience β fast feedback loop
Throughput
1 billion jobs/day
GitHub Actions scale (millions of repos, matrix expansion)
Availability
99.95%
CI/CD is critical but not user-facing; brief queuing delays are acceptable
Job execution guarantee
Exactly-once semantics
No skipped jobs, no double deployments
Max workflow duration
6 hours
Bound resource consumption; force efficient pipelines
Log delivery latency
< 2s from step output
Real-time streaming is expected by developers
Scale Estimation (Back-of-Envelope)
Why these numbers matter for the interview:
2.1M concurrent jobs means you cannot schedule from a single machine β you need distributed, stateless schedulers
58K DB writes/sec for step state transitions means PostgreSQL needs sharding or you use CDC (Change Data Capture) to avoid hot writes
100 TB/day of logs means PostgreSQL is out for log storage β you need object storage (S3) with a streaming layer
7.2M peak containers means autoscaling must be event-driven (not metric-based) to react in seconds
Staff+ Signal: The derived number most candidates miss is concurrent running jobs. They estimate throughput (jobs/sec) but forget to multiply by duration. At 3 minutes average duration, 11,600 jobs/sec means 2.1 million jobs running simultaneously. This single number drives fleet sizing, database connection pooling, and determines whether your scheduler can be centralized or must be distributed.
4. REST API Design
Workflow Registration & Triggering
Workflow Runs & Jobs
Real-Time Log Streaming
Webhook Payload (Git Push Trigger)
Key API Design Decisions
Trigger response
202 Accepted with run_id
Workflow execution is async; don't block the push
Log streaming
SSE over WebSocket
Simpler, HTTP-compatible, works through proxies. WebSocket for bidirectional status updates.
Pagination
Cursor-based
Job lists grow; offset pagination degrades on large datasets
Idempotency
Idempotency-Key header on POST endpoints
Prevents duplicate workflow triggers on network retries
Versioning
/v1/ path prefix
One-way door β webhook consumers depend on these formats
Staff+ Signal: The log streaming endpoint is the most technically interesting. You're streaming output from a container running on a worker machine, through a log aggregation service, to an SSE endpoint the browser consumes. The challenge is backpressure: a build generating 10,000 lines/sec must not overwhelm the browser. The solution is server-side buffering with configurable flush intervals (e.g., batch every 100ms or 100 lines, whichever comes first).
5. Phase 1: Single Machine CI Runner
Start simple. One machine, one runner binary, sequential job execution.
How It Works
Webhook arrives: An HTTP server receives the
pushevent payloadYAML parsing: Read
.github/workflows/*.yml, evaluateon:triggers against the eventStep execution: For each matching workflow, iterate through jobs and steps sequentially:
Status reporting: Update a SQLite database with job/step status. Serve a simple dashboard.
What Works at This Scale
Simple and correct: No distributed systems complexity. Easy to debug.
Steps share a filesystem: Build step produces binary β test step finds it on disk. No artifact passing needed within a job.
One job at a time: The runner processes one job, finishes, then picks up the next. No concurrency bugs.
Architectural Properties (preserved in the distributed version)
Jobs are isolated from each other: Different jobs get separate workspace directories
Steps within a job share state: Same filesystem, same environment
Container-based isolation is optional:
container:in the YAML triggers Docker; otherwise steps run on the host
This single-machine model is exactly how a self-hosted GitHub Actions runner works. You install the runner binary, register it with GitHub, and it processes jobs sequentially. GitHub's open-source runner is a C# codebase forked from the Azure Pipelines Agent.
6. Why the Naive Approach Fails (The Math)
The single-machine runner works perfectly for a small team. Here's where it breaks:
Problem 1: Throughput Ceiling
Even with multiple runners on one machine, you're limited by CPU, memory, and I/O. A beefy 96-core machine running 20 parallel containers handles ~10,000 jobs/day. You need 100,000 machines to hit 1B jobs/day β you need distributed scheduling.
Problem 2: Single Database Bottleneck
Problem 3: No Isolation Between Tenants
One user's npm install pulling 2GB of dependencies starves other users' builds of disk I/O. One user's infinite loop consumes 100% CPU. There's no resource boundary between tenants.
Problem 4: The Exactly-Once Problem Emerges
This is where the real system design begins. At scale, you need:
A stateless scheduler that can crash and restart without losing state
A lease-based job acquisition protocol so workers can claim jobs without conflicts
A step-level state machine so crashed jobs can be resumed, not restarted
Per-tenant resource isolation so noisy neighbors can't degrade the platform
7. Phase 2: Distributed Architecture

Architecture Overview
How Real Companies Built This
The architecture above isn't theoretical β it's derived from how GitHub Actions, GitLab CI, and Buildkite actually work.

GitHub Actions uses a pull-based model where runners long-poll a Broker API (broker.actions.githubusercontent.com). Each runner's Listener process makes GET /message requests with a 50-second long-poll timeout. When a job is available, the runner calls POST /acquirejob to claim it within a 2-minute window. A heartbeat loop calls POST /renewjob every 60 seconds with a 10-minute TTL β this is the lease mechanism that enables exactly-once semantics.
GitLab CI uses a similar pull model with its own runner binary. Runners register with the GitLab instance and poll for jobs. GitLab's unique contribution is the stages: concept (in addition to needs:), providing a simpler mental model for linear pipelines.
Buildkite takes the "bring your own compute" approach β Buildkite provides orchestration, you host all runners. Their Elastic CI Stack for AWS auto-scales runners using CloudFormation.
Runner model
Pull (Broker API)
Pull (GitLab API)
Pull (API)
Default runners
Hosted ephemeral VMs
Shared or self-hosted
Self-hosted only
Job dependency
needs: keyword
needs: + stages:
depends_on:
Autoscaling
Webhook-driven (workflow_job events)
Docker Machine / K8s executor
Elastic CI Stack
Isolation
Fresh VM per job (hosted)
Docker/K8s (configurable)
Customer-managed
Key Data Structure: Job/Step State Machine
Every job and step moves through a strict state machine. This is the foundation of crash recovery and exactly-once semantics:
Staff+ Signal: The state machine is the single most important data structure in the system. If you get this right, crash recovery and exactly-once semantics fall out naturally. If you get it wrong, you'll spend months debugging ghost jobs and double deployments. The key insight is that state transitions are idempotent β writing
COMPLETEDtwice has no effect, but transitioning fromQUEUEDtoRUNNINGmust happen exactly once (via a conditional UPDATE with version check).
8. Core Component Deep Dives
8.1 Workflow Engine (Event β Parse YAML β Create Steps)
The Workflow Engine is the first component in the pipeline. It receives events from the event bus and turns them into executable job records.
Responsibilities:
Event matching: Read every
.ymlfile under.github/workflows/at the push's HEAD commit. Evaluateon:triggers (branch filters, path filters, event types) against the incoming event.YAML parsing: Extract jobs, steps, matrix strategies, environment variables, and secrets references.
Job record creation: For each matching workflow, create job records in the database with status
QUEUED. For linear execution, jobs are ordered by sequence number. For DAG execution, jobs record theirneeds:dependencies.Matrix expansion: A matrix strategy like
{os: [ubuntu, macos], node: [18, 20]}expands one job definition into 4 concrete jobs.
Design choice: The Workflow Engine is stateless. It reads the event, writes job records to the database, and moves on. If it crashes mid-processing, the event is re-consumed from Kafka and the engine re-creates the same jobs (idempotent via idempotency_key = hash(event_id, workflow_file, commit_sha)).
8.2 Job Scheduler (Stateless, CDC-Based)
The scheduler is the brain of the system. It determines which jobs are ready to run and places them in the appropriate queue.
The CDC Pattern (from real interview feedback β Story 2):
Instead of the scheduler actively polling the database for ready jobs, use Change Data Capture:
Workflow Engine creates all steps with status
PENDINGin the databaseScheduler queues the first step (or all root jobs in DAG mode)
When a step completes, the database triggers a CDC event
CDC event is consumed by the scheduler, which evaluates whether the next step is now ready
If ready, scheduler queues the next step
Why CDC instead of polling?
Polling at 58K state transitions/sec requires aggressive database queries
CDC is event-driven β zero wasted reads
Decouples the scheduler from the database's read capacity
The scheduler becomes truly stateless β its entire state is "read CDC events, write to queue"
Why stateless?
If the scheduler crashes, restart it. It reads from the CDC stream (which has a durable offset) and resumes.
No in-memory state to lose. No failover election. No split-brain.
Multiple scheduler instances can run concurrently, partitioned by tenant or event stream partition.
Staff+ Signal: The stateless scheduler with CDC is the pattern that separates L5 from L6 answers. An L5 candidate designs a scheduler that polls the database. An L6 candidate uses CDC to make the scheduler event-driven and stateless. The key phrase: "The scheduler has no state of its own β it's a pure function from CDC events to queue operations."
8.3 Worker/Runner (K8s Pods + Docker)

The worker is the component that actually executes code. In GitHub Actions, this is the open-source runner binary, which consists of two processes:
The Listener (long-lived orchestrator):
Session management: Creates a session with the Broker API, sending runner agent ID, name, version, and OS info
Message polling: Long-polls
GET /message?sessionId=X&status=Onlinewith 50-second timeout. Returns aRunnerJobRequestor202 Accepted(no work)Job acquisition: Extracts
runner_request_id, callsPOST /acquirejobon the Run Service. Has ~2 minutes to claim the job (the lease window)Lock renewal (heartbeat): Background task calls
POST /renewjobevery 60 seconds. Lock TTL is 10 minutes. If heartbeat stops, the job is considered abandoned.
The Worker (short-lived, spawned per job):
Checkout: Download repository at the specified commit SHA
Action resolution: For
uses:steps, download the action from GitHub or Docker HubStep execution: Run shell commands or action entrypoints. Inject environment variables and decrypted secrets. Mask secrets in log output.
Log streaming: Capture stdout/stderr and stream to the Log Service in real-time (batched every 100ms)
Artifact upload: Handle
actions/upload-artifactby uploading to S3-compatible artifact storeOutput variables: Monitor
$GITHUB_OUTPUTfile for variables passed to subsequent steps
Kubernetes-Based Execution (production pattern):
For managed runners at scale, each job runs in an ephemeral K8s pod:
Pod spec is generated from the
runs-on:label (maps to a node pool with matching resources)Pod lifecycle: create β pull image β execute steps β upload logs β terminate
Resource limits enforced by K8s: CPU, memory, disk, network
Each pod gets a unique workspace volume (ephemeral storage or PVC)
Pod is deleted after job completion β perfect isolation, no state leaks
8.4 Real-Time Status Service (SSE/WebSocket)
Developers expect to watch their builds in real time. This requires a streaming infrastructure:
Architecture:
SSE for log streaming:
WebSocket for status updates:
Key design decisions:
Backpressure: Server-side buffering batches log lines (100ms or 100 lines, whichever first) to prevent overwhelming the browser
Reconnection: SSE has built-in reconnection. Client sends
Last-Event-IDheader to resume from where it left offFan-out: Multiple users watching the same build share a single log subscription (pub/sub at the gateway layer)
Persistence: Logs are simultaneously streamed to S3 for permanent storage and served from a time-windowed buffer for live viewing
9. Exactly-Once Execution Deep Dive
This is the core question in 3 out of 5 interview stories for CI/CD system design. Dedicate 5 minutes of your interview to this.
Why It's Hard
The Solution: Lease-Based Acquisition + Idempotent Execution
1. Lease-based job acquisition (prevents concurrent execution):
2. Heartbeat renewal (extends the lease while working):
3. Lease expiry (detects dead workers):
4. Idempotency keys for step execution:
Each step execution gets a deterministic idempotency key:
For deployment steps, this key is passed to the deployment system, which checks:
Has this exact deployment already been applied?
If yes, return the cached result without re-executing
If no, execute and record the key
What "Exactly-Once" Really Means
Exactly-once execution is impossible in a distributed system (proven by the Two Generals Problem). What we actually implement:
The State Machine That Makes It Work
Staff+ Signal: The 10-minute heartbeat TTL is a critical design trade-off. Shorter TTL (2 min) detects dead workers faster but causes false positives during GC pauses or network blips. Longer TTL (30 min) wastes time when workers genuinely crash. GitHub Actions uses 10 minutes. The right answer in an interview is to state the trade-off, pick a number, and explain when you'd adjust it.
10. Database Design Deep Dive
Core Tables (PostgreSQL)
Step State Machine with CDC
The key insight: instead of the scheduler polling for ready steps, use CDC to react to state changes.
CDC tool: Debezium connected to PostgreSQL's WAL (logical replication). Publishes step status changes to Kafka. Scheduler consumes from this topic.
Why this matters: At 58K step transitions/sec, polling the database every 100ms would generate 10 queries/sec per scheduler instance. With 100 scheduler instances, that's 1,000 queries/sec just for scheduling β wasted read capacity. CDC makes it zero.
Indexing Strategy and Query Patterns
Hot-path queries (>10K QPS β must be cached or indexed):
Warm-path queries (100-1K QPS):
Sharding Strategy
At 1B jobs/day, a single PostgreSQL instance can't handle the write volume. Shard by tenant_id:
Staff+ Signal: The partial indexes on
jobsare critical for scheduler performance. Without theWHERE status = 'QUEUED'filter, the index includes all 1B+ historical jobs. With the partial index, it only includes the ~100K currently queued jobs β orders of magnitude smaller, fitting entirely in memory. This is the difference between a 1ms and a 100ms scheduler query.
11. The Scaling Journey
Stage 1: Single Runner (0β1K jobs/day)
The Phase 1 architecture. One machine, one runner binary, SQLite for state.
Capability: Sequential job execution. No parallelism. No isolation. Limit: One job at a time. One machine's resources.
Stage 2: Multiple Runners + Queue (1Kβ100K jobs/day)
New capabilities at this stage:
Multiple runners poll from a shared Redis queue
Pull-based model: runners self-register and poll for work
Basic heartbeat (60s) with lease renewal
PostgreSQL replaces SQLite for shared state
Ephemeral containers (Docker) for job isolation
Limit: Single queue becomes bottleneck. No tenant isolation. Single PostgreSQL node limits write throughput.
Stage 3: K8s-Based with Autoscaling (100Kβ10M jobs/day)

New capabilities at this stage:
Kafka for event ingestion (durability, replay, partitioning)
CDC-based stateless scheduler (no polling)
K8s pods for job execution (ephemeral, auto-scaled)
Webhook-driven autoscaling:
workflow_job.queuedevents trigger pod creationRunner pools segmented by label (capability-based routing)
Per-tenant queue partitioning begins
Runner Pool Architecture (enterprise pattern):
General Linux
ubuntu-latest
c5.2xlarge spot
Most CI jobs
ARM Builds
self-hosted-arm64
m6g.2xlarge
ARM container images
GPU Testing
gpu-runner
p3.2xlarge
ML model training/testing
Secure Builds
sox-compliant
Isolated VPC
Compliance-sensitive builds
Large Monorepo
xlarge
c5.4xlarge, 500GB SSD
Builds requiring > 14GB RAM
Autoscaling algorithm:
Limit: Single-region. Single PostgreSQL cluster. Need dedicated ops team for Kafka + K8s.
Stage 4: Multi-Tenant Enterprise (10M+ jobs/day)
Everything in Stage 3, plus:
Database sharding by tenant_id (16 β 256 shards)
Multi-region deployment with geo-routed job queues
Dedicated runner pools for enterprise tenants (SLA-backed)
Spot/preemptible instances for 60-90% cost savings on non-critical jobs
Pre-baked VM images with runner binary + common tools pre-installed (boot-to-first-step < 15 seconds)
Multi-level caching: dependency cache (per repo), Docker layer cache (per runner pool), build cache (Bazel/Gradle)
Staff+ Signal: Autoscaling must be event-driven, not metric-driven. Webhook-based autoscaling (scaling on
workflow_job.queuedevents) reacts in seconds. Metric-based autoscaling (watching CPU) reacts in minutes. For bursty CI workloads β merge queues, business-hours spikes, monorepo matrix expansions β the difference between 5-second and 5-minute scaling is the difference between "builds are fast" and "builds are queued."
12. Failure Modes & Resilience

Failure Scenarios
Worker crash mid-job
Heartbeat timeout (10 min)
Lease expires β job re-queued β another worker picks up
Single job; idempotency keys prevent double execution
Worker crash mid-deploy step
Heartbeat timeout
Re-queued; deploy step checks idempotency key before re-executing
Single job; deployment system must support idempotent deploys
Scheduler crash
Process monitor / K8s restart
Stateless restart; resumes from CDC stream offset
Zero β scheduler has no state. Brief scheduling delay while restarting.
Database failure (PostgreSQL)
Connection pool errors, health checks
Failover to synchronous replica (RDS Multi-AZ)
All new job scheduling pauses; running jobs continue unaffected (workers have job payload in memory)
Kafka broker failure
ISR replication; consumer lag alerts
Automatic leader election; producers retry
Brief event ingestion delay; no data loss if replication factor β₯ 3
Log streaming service down
Health checks; SSE connection drops
Workers buffer logs locally; flush to S3 on reconnection
Developers can't watch live logs but builds continue normally
Network partition (worker β control plane)
Heartbeat fails to reach server
Worker continues executing (optimistic); if lease expires before reconnection, job is re-queued
May cause duplicate execution β mitigated by idempotency keys
Poison YAML (workflow causes OOM)
Container OOM-killed signal
K8s restarts pod; job marked FAILED with OOM error
Single job in single tenant; K8s resource limits contain blast radius
Worker Crash Recovery Flow
Multi-Tenancy Isolation
At enterprise scale, isolation is the difference between a P1 and a P4 incident:
VM-level isolation: Each GitHub-hosted job runs in a fresh VM destroyed after use. No state leaks between jobs.
Network isolation: Self-hosted runners can be placed in private VPCs with restricted egress (SOX/HIPAA).
Secret scoping: Secrets are scoped to repository, environment, or organization level. Encrypted at rest, injected at runtime, masked in logs.
Runner groups: Organizations create runner groups with access control β restricting which repositories can use which runner pools.
Resource limits: K8s enforces CPU, memory, and disk limits per pod. One tenant's build can't starve another.
Staff+ Signal: The most dangerous failure mode is the "silent success." Worker A deploys to production, then crashes before recording success. Worker B re-deploys. If the deployment isn't idempotent (e.g., it runs database migrations), you get data corruption. The fix isn't in the CI/CD system β it's in requiring deployment targets to support idempotent operations. This is a cross-system design constraint that most candidates miss.
13. Observability & Operations
Key Metrics
cicd_jobs_queued_total{tenant, runner_label}β inbound job rate; primary scaling signalcicd_jobs_running_gauge{runner_label}β current concurrent jobs; capacity indicatorcicd_job_wait_time_seconds{tenant, quantile}β time from QUEUED to RUNNING; the metric developers care about mostcicd_job_duration_seconds{tenant, quantile}β execution time; helps identify slow buildscicd_step_failures_total{tenant, step_name}β step failure rate; catch flaky testscicd_lease_expirations_totalβ dead worker detection; spikes mean infrastructure issuescicd_queue_depth{runner_label}β pending jobs per label; the primary autoscaling signalcicd_estimated_drain_time_seconds{runner_label}β (queue_depth Γ avg_duration) / workers; best single metric for system healthcicd_scheduler_cdc_lag_secondsβ CDC consumer lag; if growing, scheduler is falling behind
Distributed Tracing
A full trace for a single workflow run:
Alerting Strategy
Queue drain time > 5 min
Sustained for 10 min
P2
Scale runners; investigate slow builds
Lease expirations > 10/min
Spike detection
P1
Worker fleet issue; check node health, OOM kills
Job wait time p99 > 60s
Sustained for 5 min
P2
Scale runners for affected label
CDC lag > 30s
Growing trend
P1
Scheduler is falling behind; scale scheduler instances
Step failure rate > 20%
Platform-wide for 15 min
P1
Possible infrastructure issue (Docker registry down, network)
Database replication lag > 5s
Any replica
P2
Investigate replica health; may affect read-after-write consistency
Staff+ Signal: The most actionable dashboard isn't per-metric β it's a tenant-level heatmap showing job wait time Γ failure rate. A single red cell means one tenant has a broken pipeline (their problem). A full red column means a platform-wide outage (your problem). This distinction determines whether you page the customer or yourself.
14. Design Trade-offs
Runner model
Pull-based (runners poll for work)
Push-based (server dispatches to runners)
Pull-based
Runners self-register; no server-side runner registry needed. Firewall-friendly (outbound connections only). Horizontal scaling is trivial. Trade-off: 0-50s polling latency.
Execution environment
Ephemeral (fresh VM/container per job)
Persistent (long-running runners)
Ephemeral for hosted; persistent for self-hosted
Ephemeral gives perfect isolation and security. Persistent gives cache locality and instant startup. Use ephemeral by default, allow persistent for specialized hardware.
Job dependencies
Linear (sequential)
DAG (needs: keyword)
Linear first, DAG as extension
Linear is simpler to implement and debug. Most workflows are effectively linear. DAG adds topological sort, cycle detection, and complex failure propagation. Ship linear, iterate to DAG.
Step progression
Polling scheduler
CDC-based scheduler
CDC
CDC is event-driven (zero wasted reads) and makes the scheduler stateless. Polling works at small scale but wastes DB read capacity at 58K transitions/sec.
Scheduling
Centralized scheduler
Distributed scheduling
Centralized with HA
CI scheduling doesn't need stock-exchange throughput. A well-replicated centralized scheduler handles the load and is simpler to reason about.
Workflow definition
Declarative YAML
Imperative code (Go/Python)
YAML
Easy to learn, lint, validate, and visualize. Limited expressiveness is offset by composable actions (the "standard library"). Code-based (like Dagger) offers more power but harder governance.
Log storage
PostgreSQL
S3 + streaming layer
S3 + streaming
100 TB/day of logs rules out PostgreSQL. S3 for persistence, SSE for live streaming. Buffer recent logs in Redis for fast retrieval.
Staff+ Signal: The pull vs. push decision is the most consequential architectural choice. Pull-based wins for cloud-native, auto-scaled environments because it decouples the control plane from the worker fleet. The server doesn't need to know which workers are online β it just puts jobs in queues. Workers that poll get work; workers that don't, don't. This makes horizontal scaling, failure recovery, and multi-tenancy almost free. The only cost is latency (0-50s poll interval), which is acceptable for CI/CD.
15. Common Interview Mistakes
Starting with DAG when the interviewer wants linear: "First, I'll build a topological sort for the job dependency graph..." β The interviewer explicitly said jobs run sequentially. You just spent 5 minutes on something they told you to skip. Staff+ answer: "I'll design for linear execution first. Each job runs after the previous one completes. The state machine extends naturally to DAG by adding a
needs:field β I can show that if we have time."Ignoring exactly-once execution: "The worker picks up a job and runs it." β What happens when the worker crashes? What if two workers pick up the same job? This is THE core question. Staff+ answer: lease-based acquisition with heartbeat, optimistic locking on state transitions, idempotency keys on steps.
No multi-tenancy from the start: "All jobs go in one queue." β One tenant's 10,000-job matrix expansion blocks everyone else's single-job PR check. Staff+ answer: queues partitioned by runner label, with per-tenant fair scheduling or priority tiers.
Not mentioning Kubernetes/Docker: "Jobs run on the machine." β How are they isolated? How do you limit CPU/memory? How do you prevent a malicious workflow from reading another tenant's secrets? Staff+ answer: ephemeral K8s pods with resource limits, network policies, and destroyed after completion.
No real-time status UI: "Results are available after the workflow completes." β Developers stare at CI for minutes. They need live log streaming. Staff+ answer: SSE for log streaming, WebSocket for status updates, with backpressure handling.
Waiting to be asked about crash recovery: "If the worker crashes... hmm, I hadn't thought about that." β Bring this up proactively. It's the whole point of the question. Staff+ answer: "The lease mechanism handles worker crashes. Let me walk through the timeline: worker acquires job, starts heartbeat, crashes at step 3, lease expires after 10 minutes, scheduler re-queues, another worker picks it up, idempotency keys prevent double-execution of completed steps."
Designing a monolithic scheduler with in-memory state: "The scheduler keeps track of all running jobs in a HashMap." β When this process crashes, all state is lost. Staff+ answer: stateless scheduler that reads from CDC stream. All state lives in the database. Scheduler can crash and restart without losing anything.
16. Interview Cheat Sheet
Time Allocation (45-minute interview)
Clarify requirements
3 min
Linear vs DAG, scale (1B jobs/day), isolation model, real-time requirements
REST API design
5 min
Trigger endpoint (202 Accepted), job status, log streaming (SSE), webhook payload
High-level design
10 min
Event bus β workflow engine β stateless scheduler (CDC) β job queues β K8s runner pods
Deep dive: exactly-once
7 min
Lease-based acquisition, heartbeat + TTL, optimistic locking, idempotency keys, state machine
Deep dive: database + scaling
10 min
Schema (workflows/jobs/steps), CDC for step progression, sharding by tenant_id, partial indexes
Failure modes + trade-offs
7 min
Worker crash β lease expiry β re-queue, scheduler crash β stateless restart, pull vs push
Wrap-up
3 min
Scaling journey summary, what I'd build next (DAG, caching, multi-region)
Step-by-Step Answer Guide
Clarify: "Are jobs linear or DAG-based? I'll start with linear. How many jobs/day? 1 billion = ~12K/sec average, ~40K peak. Do we need real-time log streaming?"
REST API: Show trigger endpoint (
POST /workflows/{id}/dispatchesβ202 Acceptedwithrun_id), job status (GET /runs/{run_id}/jobs), and log streaming (GET /jobs/{job_id}/logs/streamvia SSE).Key insight: "The hard part isn't running shell commands β it's guaranteeing every job executes exactly once even when workers crash."
Single machine: One runner, sequential execution, SQLite. Works under 1K jobs/day.
Prove it fails: "58K step transitions/sec overwhelms a single database. One tenant's matrix expansion blocks everyone. A crashed worker leaves a job in limbo β did it complete? Do we re-run?"
Distributed architecture: Event bus (Kafka) β stateless workflow engine β CDC-based scheduler β job queues (partitioned by label) β K8s runner pods. Show the mermaid diagram.
Exactly-once deep dive: "Worker acquires job via conditional UPDATE with optimistic locking. Heartbeat every 60s extends 10-min lease. If worker crashes, lease expires, reaper re-queues. Steps have idempotency keys β re-execution of completed steps is a no-op."
Database design: Show the schema with partial indexes. Explain CDC from PostgreSQL WAL via Debezium. Sharding by tenant_id for write distribution.
Failure handling: Walk through worker crash timeline. Scheduler crash is a non-event (stateless, restarts from CDC offset). DB failover via synchronous replication.
Scaling journey: Single runner β runner pool + Redis queue β K8s + Kafka + CDC β sharded multi-region.
Trade-offs: Pull vs push (pull wins for scale), linear vs DAG (linear first), CDC vs polling (CDC at scale), ephemeral vs persistent runners.
What the Interviewer Wants to Hear
At L5/Senior: Pull-based runners, job queue, basic heartbeat, container isolation. Mentions exactly-once as a concern.
At L6/Staff: Stateless scheduler with CDC, lease-based acquisition with optimistic locking, idempotency keys for steps, partial indexes, sharding by tenant_id. References how GitHub Actions actually works (Broker API, 10-min TTL, runner binary architecture).
At L7/Principal: Organizational boundaries (who owns the workflow engine vs. scheduler vs. runner fleet), multi-region job routing, tenant-level SLA design with blast radius guarantees, migration path from linear to DAG scheduling, cost modeling (spot instances, pre-baked images, cache hit rates).

Advanced Extension: DAG Scheduling
Once you've covered linear execution, extend to DAG with the needs: keyword:
The DAG scheduler extends the linear model:
Root jobs (no
needs:) are queued immediatelyCompletion callback: When a job completes, iterate all jobs listing it in
needs:. If all parents completed, queue the downstream job.Failure propagation: If a parent fails, downstream jobs are skipped (not failed).
if: always()overrides this.Matrix expansion: A matrix job like
{os: [ubuntu, macos]}expands into 2 jobs.needs: [build]waits for all matrix instances.
The state machine is identical β only the "when to queue next" logic changes. This is why designing for linear first is correct: DAG is a natural extension, not a rewrite.
Written as a reference for staff-level system design interviews. The architectural patterns described here apply beyond CI/CD to any distributed task orchestration system β job schedulers, data pipelines, deployment platforms, and batch processing systems.
Last updated