System Design: CI/CD Pipeline Like GitHub Actions

From a Single Runner to Enterprise-Scale Distributed Execution: A Staff Engineer's Guide


Table of Contents


1. The Problem & Why It's Hard

You're asked to design a CI/CD execution system like GitHub Actions. When a developer pushes code, the system must parse a workflow definition (YAML), execute a series of steps (build, test, deploy) on isolated compute, and report results in real time. Millions of workflow runs per day across hundreds of thousands of repositories.

On the surface, it's "just run some shell commands when code is pushed." The trap is thinking the hard part is execution. The hard part is guaranteeing that every job runs exactly once β€” even when workers crash, networks partition, and thousands of jobs are queued simultaneously.

The interviewer's real question: Can you design a multi-tenant job execution system that guarantees exactly-once semantics, scales horizontally, and recovers gracefully from worker crashes β€” while explaining the state machine that makes it all work?

The interviewer isn't asking you to build GitHub Actions. They're testing whether you understand:

  1. How to turn a declarative workflow into an executable plan

  2. How to distribute that plan across unreliable workers

  3. How to ensure no job is skipped or double-executed

  4. How to provide real-time feedback to users watching their builds

Staff+ Signal: "Exactly-once execution" is mentioned in 3 out of 5 interview stories for this question. Most candidates hand-wave it. The staff-level answer is: exactly-once is impossible in a distributed system β€” what you actually build is at-least-once delivery + idempotent execution. The lease-based job acquisition pattern (heartbeat + TTL) is the mechanism. If you can explain why this works and where it breaks, you've answered the core question.


2. Clarifying Questions

Before drawing a single box, ask these questions. Each one narrows the design space and shows the interviewer you think before you build.

Questions to Ask

Question
Why It Matters
What Changes

"Are jobs in a workflow independent, or do they have dependencies?"

Determines whether you need a DAG scheduler or a simple linear executor

Most interviewers say linear first β€” jobs run sequentially. This is a gift: it simplifies scheduling enormously. Offer DAG as an extension.

"How many concurrent workflows do we need to support?"

Drives queue depth, worker fleet size, and database throughput

If the answer is "many customers, like GitHub" β†’ design for multi-tenancy from the start

"Do users need real-time log streaming, or is post-completion sufficient?"

Determines whether you need WebSocket/SSE infrastructure

Real-time is almost always required β€” it's table stakes for CI/CD UX

"What's the isolation model? Containers? VMs? Bare metal?"

Affects security boundaries and startup latency

Docker containers for cost efficiency, ephemeral VMs for security-critical workloads

"Do we need to support self-hosted runners, or only managed compute?"

Determines whether runners pull work or the system pushes to them

Pull-based is the industry standard β€” it's firewall-friendly and scales independently

"What's the maximum workflow duration?"

Affects heartbeat TTL and resource reservation

GitHub Actions caps at 6 hours. This bounds your lease duration.

The Key Simplification

Interviewers often say: "Start with linear execution β€” jobs run one after another, not as a complex graph." This is intentional. They want to see:

  1. You can build a working system with simple sequential execution

  2. You know that DAG scheduling is the natural extension

  3. You don't over-engineer from the start

Staff+ Signal: The best candidates ask about linear vs. DAG upfront, then say: "I'll design for linear execution first, then show how the same state machine extends to DAG with the needs: keyword." This demonstrates progressive complexity β€” the hallmark of a senior system designer.


3. Requirements & Scope

Functional Requirements

  • Workflow registration: Users define workflows in YAML (trigger events, jobs, steps)

  • Event-driven triggering: git push, pull request, schedule (cron), manual dispatch

  • Job execution: Run steps (shell commands, Docker containers, reusable actions) in isolated environments

  • Step sequencing: Steps within a job execute sequentially, sharing a filesystem

  • Job dependencies: Support needs: for job ordering (linear first, DAG as extension)

  • Real-time status: Live log streaming and status updates (pending β†’ running β†’ success/failed)

  • Artifact management: Upload/download build artifacts between jobs

  • Secret management: Encrypted secrets injected at runtime, masked in logs

  • Multi-tenancy: Complete isolation between organizations/repositories

Non-Functional Requirements

Requirement
Target
Rationale

Job pickup latency (p99)

< 30s from trigger event

Developer experience β€” fast feedback loop

Throughput

1 billion jobs/day

GitHub Actions scale (millions of repos, matrix expansion)

Availability

99.95%

CI/CD is critical but not user-facing; brief queuing delays are acceptable

Job execution guarantee

Exactly-once semantics

No skipped jobs, no double deployments

Max workflow duration

6 hours

Bound resource consumption; force efficient pipelines

Log delivery latency

< 2s from step output

Real-time streaming is expected by developers

Scale Estimation (Back-of-Envelope)

Why these numbers matter for the interview:

  • 2.1M concurrent jobs means you cannot schedule from a single machine β€” you need distributed, stateless schedulers

  • 58K DB writes/sec for step state transitions means PostgreSQL needs sharding or you use CDC (Change Data Capture) to avoid hot writes

  • 100 TB/day of logs means PostgreSQL is out for log storage β€” you need object storage (S3) with a streaming layer

  • 7.2M peak containers means autoscaling must be event-driven (not metric-based) to react in seconds

Staff+ Signal: The derived number most candidates miss is concurrent running jobs. They estimate throughput (jobs/sec) but forget to multiply by duration. At 3 minutes average duration, 11,600 jobs/sec means 2.1 million jobs running simultaneously. This single number drives fleet sizing, database connection pooling, and determines whether your scheduler can be centralized or must be distributed.


4. REST API Design

Workflow Registration & Triggering

Workflow Runs & Jobs

Real-Time Log Streaming

Webhook Payload (Git Push Trigger)

Key API Design Decisions

Decision
Choice
Rationale

Trigger response

202 Accepted with run_id

Workflow execution is async; don't block the push

Log streaming

SSE over WebSocket

Simpler, HTTP-compatible, works through proxies. WebSocket for bidirectional status updates.

Pagination

Cursor-based

Job lists grow; offset pagination degrades on large datasets

Idempotency

Idempotency-Key header on POST endpoints

Prevents duplicate workflow triggers on network retries

Versioning

/v1/ path prefix

One-way door β€” webhook consumers depend on these formats

Staff+ Signal: The log streaming endpoint is the most technically interesting. You're streaming output from a container running on a worker machine, through a log aggregation service, to an SSE endpoint the browser consumes. The challenge is backpressure: a build generating 10,000 lines/sec must not overwhelm the browser. The solution is server-side buffering with configurable flush intervals (e.g., batch every 100ms or 100 lines, whichever comes first).


5. Phase 1: Single Machine CI Runner

Start simple. One machine, one runner binary, sequential job execution.

How It Works

  1. Webhook arrives: An HTTP server receives the push event payload

  2. YAML parsing: Read .github/workflows/*.yml, evaluate on: triggers against the event

  3. Step execution: For each matching workflow, iterate through jobs and steps sequentially:

  1. Status reporting: Update a SQLite database with job/step status. Serve a simple dashboard.

What Works at This Scale

  • Simple and correct: No distributed systems complexity. Easy to debug.

  • Steps share a filesystem: Build step produces binary β†’ test step finds it on disk. No artifact passing needed within a job.

  • One job at a time: The runner processes one job, finishes, then picks up the next. No concurrency bugs.

Architectural Properties (preserved in the distributed version)

  • Jobs are isolated from each other: Different jobs get separate workspace directories

  • Steps within a job share state: Same filesystem, same environment

  • Container-based isolation is optional: container: in the YAML triggers Docker; otherwise steps run on the host

This single-machine model is exactly how a self-hosted GitHub Actions runner works. You install the runner binary, register it with GitHub, and it processes jobs sequentially. GitHub's open-source runner is a C# codebase forked from the Azure Pipelines Agent.


6. Why the Naive Approach Fails (The Math)

The single-machine runner works perfectly for a small team. Here's where it breaks:

Problem 1: Throughput Ceiling

Even with multiple runners on one machine, you're limited by CPU, memory, and I/O. A beefy 96-core machine running 20 parallel containers handles ~10,000 jobs/day. You need 100,000 machines to hit 1B jobs/day β€” you need distributed scheduling.

Problem 2: Single Database Bottleneck

Problem 3: No Isolation Between Tenants

One user's npm install pulling 2GB of dependencies starves other users' builds of disk I/O. One user's infinite loop consumes 100% CPU. There's no resource boundary between tenants.

Problem 4: The Exactly-Once Problem Emerges

This is where the real system design begins. At scale, you need:

  • A stateless scheduler that can crash and restart without losing state

  • A lease-based job acquisition protocol so workers can claim jobs without conflicts

  • A step-level state machine so crashed jobs can be resumed, not restarted

  • Per-tenant resource isolation so noisy neighbors can't degrade the platform


7. Phase 2: Distributed Architecture

High-level architecture diagram

Architecture Overview

How Real Companies Built This

The architecture above isn't theoretical β€” it's derived from how GitHub Actions, GitLab CI, and Buildkite actually work.

End-to-end request flow diagram

GitHub Actions uses a pull-based model where runners long-poll a Broker API (broker.actions.githubusercontent.com). Each runner's Listener process makes GET /message requests with a 50-second long-poll timeout. When a job is available, the runner calls POST /acquirejob to claim it within a 2-minute window. A heartbeat loop calls POST /renewjob every 60 seconds with a 10-minute TTL β€” this is the lease mechanism that enables exactly-once semantics.

GitLab CI uses a similar pull model with its own runner binary. Runners register with the GitLab instance and poll for jobs. GitLab's unique contribution is the stages: concept (in addition to needs:), providing a simpler mental model for linear pipelines.

Buildkite takes the "bring your own compute" approach β€” Buildkite provides orchestration, you host all runners. Their Elastic CI Stack for AWS auto-scales runners using CloudFormation.

Feature
GitHub Actions
GitLab CI
Buildkite

Runner model

Pull (Broker API)

Pull (GitLab API)

Pull (API)

Default runners

Hosted ephemeral VMs

Shared or self-hosted

Self-hosted only

Job dependency

needs: keyword

needs: + stages:

depends_on:

Autoscaling

Webhook-driven (workflow_job events)

Docker Machine / K8s executor

Elastic CI Stack

Isolation

Fresh VM per job (hosted)

Docker/K8s (configurable)

Customer-managed

Key Data Structure: Job/Step State Machine

Every job and step moves through a strict state machine. This is the foundation of crash recovery and exactly-once semantics:

Staff+ Signal: The state machine is the single most important data structure in the system. If you get this right, crash recovery and exactly-once semantics fall out naturally. If you get it wrong, you'll spend months debugging ghost jobs and double deployments. The key insight is that state transitions are idempotent β€” writing COMPLETED twice has no effect, but transitioning from QUEUED to RUNNING must happen exactly once (via a conditional UPDATE with version check).


8. Core Component Deep Dives

8.1 Workflow Engine (Event β†’ Parse YAML β†’ Create Steps)

The Workflow Engine is the first component in the pipeline. It receives events from the event bus and turns them into executable job records.

Responsibilities:

  1. Event matching: Read every .yml file under .github/workflows/ at the push's HEAD commit. Evaluate on: triggers (branch filters, path filters, event types) against the incoming event.

  2. YAML parsing: Extract jobs, steps, matrix strategies, environment variables, and secrets references.

  3. Job record creation: For each matching workflow, create job records in the database with status QUEUED. For linear execution, jobs are ordered by sequence number. For DAG execution, jobs record their needs: dependencies.

  4. Matrix expansion: A matrix strategy like {os: [ubuntu, macos], node: [18, 20]} expands one job definition into 4 concrete jobs.

Design choice: The Workflow Engine is stateless. It reads the event, writes job records to the database, and moves on. If it crashes mid-processing, the event is re-consumed from Kafka and the engine re-creates the same jobs (idempotent via idempotency_key = hash(event_id, workflow_file, commit_sha)).

8.2 Job Scheduler (Stateless, CDC-Based)

The scheduler is the brain of the system. It determines which jobs are ready to run and places them in the appropriate queue.

The CDC Pattern (from real interview feedback β€” Story 2):

Instead of the scheduler actively polling the database for ready jobs, use Change Data Capture:

  1. Workflow Engine creates all steps with status PENDING in the database

  2. Scheduler queues the first step (or all root jobs in DAG mode)

  3. When a step completes, the database triggers a CDC event

  4. CDC event is consumed by the scheduler, which evaluates whether the next step is now ready

  5. If ready, scheduler queues the next step

Why CDC instead of polling?

  • Polling at 58K state transitions/sec requires aggressive database queries

  • CDC is event-driven β€” zero wasted reads

  • Decouples the scheduler from the database's read capacity

  • The scheduler becomes truly stateless β€” its entire state is "read CDC events, write to queue"

Why stateless?

  • If the scheduler crashes, restart it. It reads from the CDC stream (which has a durable offset) and resumes.

  • No in-memory state to lose. No failover election. No split-brain.

  • Multiple scheduler instances can run concurrently, partitioned by tenant or event stream partition.

Staff+ Signal: The stateless scheduler with CDC is the pattern that separates L5 from L6 answers. An L5 candidate designs a scheduler that polls the database. An L6 candidate uses CDC to make the scheduler event-driven and stateless. The key phrase: "The scheduler has no state of its own β€” it's a pure function from CDC events to queue operations."

8.3 Worker/Runner (K8s Pods + Docker)

Runner internal architecture diagram

The worker is the component that actually executes code. In GitHub Actions, this is the open-source runner binary, which consists of two processes:

The Listener (long-lived orchestrator):

  1. Session management: Creates a session with the Broker API, sending runner agent ID, name, version, and OS info

  2. Message polling: Long-polls GET /message?sessionId=X&status=Online with 50-second timeout. Returns a RunnerJobRequest or 202 Accepted (no work)

  3. Job acquisition: Extracts runner_request_id, calls POST /acquirejob on the Run Service. Has ~2 minutes to claim the job (the lease window)

  4. Lock renewal (heartbeat): Background task calls POST /renewjob every 60 seconds. Lock TTL is 10 minutes. If heartbeat stops, the job is considered abandoned.

The Worker (short-lived, spawned per job):

  1. Checkout: Download repository at the specified commit SHA

  2. Action resolution: For uses: steps, download the action from GitHub or Docker Hub

  3. Step execution: Run shell commands or action entrypoints. Inject environment variables and decrypted secrets. Mask secrets in log output.

  4. Log streaming: Capture stdout/stderr and stream to the Log Service in real-time (batched every 100ms)

  5. Artifact upload: Handle actions/upload-artifact by uploading to S3-compatible artifact store

  6. Output variables: Monitor $GITHUB_OUTPUT file for variables passed to subsequent steps

Kubernetes-Based Execution (production pattern):

For managed runners at scale, each job runs in an ephemeral K8s pod:

  • Pod spec is generated from the runs-on: label (maps to a node pool with matching resources)

  • Pod lifecycle: create β†’ pull image β†’ execute steps β†’ upload logs β†’ terminate

  • Resource limits enforced by K8s: CPU, memory, disk, network

  • Each pod gets a unique workspace volume (ephemeral storage or PVC)

  • Pod is deleted after job completion β€” perfect isolation, no state leaks

8.4 Real-Time Status Service (SSE/WebSocket)

Developers expect to watch their builds in real time. This requires a streaming infrastructure:

Architecture:

SSE for log streaming:

WebSocket for status updates:

Key design decisions:

  • Backpressure: Server-side buffering batches log lines (100ms or 100 lines, whichever first) to prevent overwhelming the browser

  • Reconnection: SSE has built-in reconnection. Client sends Last-Event-ID header to resume from where it left off

  • Fan-out: Multiple users watching the same build share a single log subscription (pub/sub at the gateway layer)

  • Persistence: Logs are simultaneously streamed to S3 for permanent storage and served from a time-windowed buffer for live viewing


9. Exactly-Once Execution Deep Dive

This is the core question in 3 out of 5 interview stories for CI/CD system design. Dedicate 5 minutes of your interview to this.

Why It's Hard

The Solution: Lease-Based Acquisition + Idempotent Execution

1. Lease-based job acquisition (prevents concurrent execution):

2. Heartbeat renewal (extends the lease while working):

3. Lease expiry (detects dead workers):

4. Idempotency keys for step execution:

Each step execution gets a deterministic idempotency key:

For deployment steps, this key is passed to the deployment system, which checks:

  • Has this exact deployment already been applied?

  • If yes, return the cached result without re-executing

  • If no, execute and record the key

What "Exactly-Once" Really Means

Exactly-once execution is impossible in a distributed system (proven by the Two Generals Problem). What we actually implement:

The State Machine That Makes It Work

Staff+ Signal: The 10-minute heartbeat TTL is a critical design trade-off. Shorter TTL (2 min) detects dead workers faster but causes false positives during GC pauses or network blips. Longer TTL (30 min) wastes time when workers genuinely crash. GitHub Actions uses 10 minutes. The right answer in an interview is to state the trade-off, pick a number, and explain when you'd adjust it.


10. Database Design Deep Dive

Core Tables (PostgreSQL)

Step State Machine with CDC

The key insight: instead of the scheduler polling for ready steps, use CDC to react to state changes.

CDC tool: Debezium connected to PostgreSQL's WAL (logical replication). Publishes step status changes to Kafka. Scheduler consumes from this topic.

Why this matters: At 58K step transitions/sec, polling the database every 100ms would generate 10 queries/sec per scheduler instance. With 100 scheduler instances, that's 1,000 queries/sec just for scheduling β€” wasted read capacity. CDC makes it zero.

Indexing Strategy and Query Patterns

Hot-path queries (>10K QPS β€” must be cached or indexed):

Warm-path queries (100-1K QPS):

Sharding Strategy

At 1B jobs/day, a single PostgreSQL instance can't handle the write volume. Shard by tenant_id:

Staff+ Signal: The partial indexes on jobs are critical for scheduler performance. Without the WHERE status = 'QUEUED' filter, the index includes all 1B+ historical jobs. With the partial index, it only includes the ~100K currently queued jobs β€” orders of magnitude smaller, fitting entirely in memory. This is the difference between a 1ms and a 100ms scheduler query.


11. The Scaling Journey

Stage 1: Single Runner (0–1K jobs/day)

The Phase 1 architecture. One machine, one runner binary, SQLite for state.

Capability: Sequential job execution. No parallelism. No isolation. Limit: One job at a time. One machine's resources.

Stage 2: Multiple Runners + Queue (1K–100K jobs/day)

New capabilities at this stage:

  • Multiple runners poll from a shared Redis queue

  • Pull-based model: runners self-register and poll for work

  • Basic heartbeat (60s) with lease renewal

  • PostgreSQL replaces SQLite for shared state

  • Ephemeral containers (Docker) for job isolation

Limit: Single queue becomes bottleneck. No tenant isolation. Single PostgreSQL node limits write throughput.

Stage 3: K8s-Based with Autoscaling (100K–10M jobs/day)

Enterprise autoscaling architecture diagram

New capabilities at this stage:

  • Kafka for event ingestion (durability, replay, partitioning)

  • CDC-based stateless scheduler (no polling)

  • K8s pods for job execution (ephemeral, auto-scaled)

  • Webhook-driven autoscaling: workflow_job.queued events trigger pod creation

  • Runner pools segmented by label (capability-based routing)

  • Per-tenant queue partitioning begins

Runner Pool Architecture (enterprise pattern):

Pool
Label
Instance Type
Use Case

General Linux

ubuntu-latest

c5.2xlarge spot

Most CI jobs

ARM Builds

self-hosted-arm64

m6g.2xlarge

ARM container images

GPU Testing

gpu-runner

p3.2xlarge

ML model training/testing

Secure Builds

sox-compliant

Isolated VPC

Compliance-sensitive builds

Large Monorepo

xlarge

c5.4xlarge, 500GB SSD

Builds requiring > 14GB RAM

Autoscaling algorithm:

Limit: Single-region. Single PostgreSQL cluster. Need dedicated ops team for Kafka + K8s.

Stage 4: Multi-Tenant Enterprise (10M+ jobs/day)

Everything in Stage 3, plus:

  • Database sharding by tenant_id (16 β†’ 256 shards)

  • Multi-region deployment with geo-routed job queues

  • Dedicated runner pools for enterprise tenants (SLA-backed)

  • Spot/preemptible instances for 60-90% cost savings on non-critical jobs

  • Pre-baked VM images with runner binary + common tools pre-installed (boot-to-first-step < 15 seconds)

  • Multi-level caching: dependency cache (per repo), Docker layer cache (per runner pool), build cache (Bazel/Gradle)

Staff+ Signal: Autoscaling must be event-driven, not metric-driven. Webhook-based autoscaling (scaling on workflow_job.queued events) reacts in seconds. Metric-based autoscaling (watching CPU) reacts in minutes. For bursty CI workloads β€” merge queues, business-hours spikes, monorepo matrix expansions β€” the difference between 5-second and 5-minute scaling is the difference between "builds are fast" and "builds are queued."


12. Failure Modes & Resilience

Failure handling and recovery flow diagram

Failure Scenarios

Failure
Detection
Recovery
Blast Radius

Worker crash mid-job

Heartbeat timeout (10 min)

Lease expires β†’ job re-queued β†’ another worker picks up

Single job; idempotency keys prevent double execution

Worker crash mid-deploy step

Heartbeat timeout

Re-queued; deploy step checks idempotency key before re-executing

Single job; deployment system must support idempotent deploys

Scheduler crash

Process monitor / K8s restart

Stateless restart; resumes from CDC stream offset

Zero β€” scheduler has no state. Brief scheduling delay while restarting.

Database failure (PostgreSQL)

Connection pool errors, health checks

Failover to synchronous replica (RDS Multi-AZ)

All new job scheduling pauses; running jobs continue unaffected (workers have job payload in memory)

Kafka broker failure

ISR replication; consumer lag alerts

Automatic leader election; producers retry

Brief event ingestion delay; no data loss if replication factor β‰₯ 3

Log streaming service down

Health checks; SSE connection drops

Workers buffer logs locally; flush to S3 on reconnection

Developers can't watch live logs but builds continue normally

Network partition (worker ↔ control plane)

Heartbeat fails to reach server

Worker continues executing (optimistic); if lease expires before reconnection, job is re-queued

May cause duplicate execution β€” mitigated by idempotency keys

Poison YAML (workflow causes OOM)

Container OOM-killed signal

K8s restarts pod; job marked FAILED with OOM error

Single job in single tenant; K8s resource limits contain blast radius

Worker Crash Recovery Flow

Multi-Tenancy Isolation

At enterprise scale, isolation is the difference between a P1 and a P4 incident:

  1. VM-level isolation: Each GitHub-hosted job runs in a fresh VM destroyed after use. No state leaks between jobs.

  2. Network isolation: Self-hosted runners can be placed in private VPCs with restricted egress (SOX/HIPAA).

  3. Secret scoping: Secrets are scoped to repository, environment, or organization level. Encrypted at rest, injected at runtime, masked in logs.

  4. Runner groups: Organizations create runner groups with access control β€” restricting which repositories can use which runner pools.

  5. Resource limits: K8s enforces CPU, memory, and disk limits per pod. One tenant's build can't starve another.

Staff+ Signal: The most dangerous failure mode is the "silent success." Worker A deploys to production, then crashes before recording success. Worker B re-deploys. If the deployment isn't idempotent (e.g., it runs database migrations), you get data corruption. The fix isn't in the CI/CD system β€” it's in requiring deployment targets to support idempotent operations. This is a cross-system design constraint that most candidates miss.


13. Observability & Operations

Key Metrics

  • cicd_jobs_queued_total{tenant, runner_label} β€” inbound job rate; primary scaling signal

  • cicd_jobs_running_gauge{runner_label} β€” current concurrent jobs; capacity indicator

  • cicd_job_wait_time_seconds{tenant, quantile} β€” time from QUEUED to RUNNING; the metric developers care about most

  • cicd_job_duration_seconds{tenant, quantile} β€” execution time; helps identify slow builds

  • cicd_step_failures_total{tenant, step_name} β€” step failure rate; catch flaky tests

  • cicd_lease_expirations_total β€” dead worker detection; spikes mean infrastructure issues

  • cicd_queue_depth{runner_label} β€” pending jobs per label; the primary autoscaling signal

  • cicd_estimated_drain_time_seconds{runner_label} β€” (queue_depth Γ— avg_duration) / workers; best single metric for system health

  • cicd_scheduler_cdc_lag_seconds β€” CDC consumer lag; if growing, scheduler is falling behind

Distributed Tracing

A full trace for a single workflow run:

Alerting Strategy

Alert
Condition
Severity
Action

Queue drain time > 5 min

Sustained for 10 min

P2

Scale runners; investigate slow builds

Lease expirations > 10/min

Spike detection

P1

Worker fleet issue; check node health, OOM kills

Job wait time p99 > 60s

Sustained for 5 min

P2

Scale runners for affected label

CDC lag > 30s

Growing trend

P1

Scheduler is falling behind; scale scheduler instances

Step failure rate > 20%

Platform-wide for 15 min

P1

Possible infrastructure issue (Docker registry down, network)

Database replication lag > 5s

Any replica

P2

Investigate replica health; may affect read-after-write consistency

Staff+ Signal: The most actionable dashboard isn't per-metric β€” it's a tenant-level heatmap showing job wait time Γ— failure rate. A single red cell means one tenant has a broken pipeline (their problem). A full red column means a platform-wide outage (your problem). This distinction determines whether you page the customer or yourself.


14. Design Trade-offs

Decision
Option A
Option B
Recommended
Why

Runner model

Pull-based (runners poll for work)

Push-based (server dispatches to runners)

Pull-based

Runners self-register; no server-side runner registry needed. Firewall-friendly (outbound connections only). Horizontal scaling is trivial. Trade-off: 0-50s polling latency.

Execution environment

Ephemeral (fresh VM/container per job)

Persistent (long-running runners)

Ephemeral for hosted; persistent for self-hosted

Ephemeral gives perfect isolation and security. Persistent gives cache locality and instant startup. Use ephemeral by default, allow persistent for specialized hardware.

Job dependencies

Linear (sequential)

DAG (needs: keyword)

Linear first, DAG as extension

Linear is simpler to implement and debug. Most workflows are effectively linear. DAG adds topological sort, cycle detection, and complex failure propagation. Ship linear, iterate to DAG.

Step progression

Polling scheduler

CDC-based scheduler

CDC

CDC is event-driven (zero wasted reads) and makes the scheduler stateless. Polling works at small scale but wastes DB read capacity at 58K transitions/sec.

Scheduling

Centralized scheduler

Distributed scheduling

Centralized with HA

CI scheduling doesn't need stock-exchange throughput. A well-replicated centralized scheduler handles the load and is simpler to reason about.

Workflow definition

Declarative YAML

Imperative code (Go/Python)

YAML

Easy to learn, lint, validate, and visualize. Limited expressiveness is offset by composable actions (the "standard library"). Code-based (like Dagger) offers more power but harder governance.

Log storage

PostgreSQL

S3 + streaming layer

S3 + streaming

100 TB/day of logs rules out PostgreSQL. S3 for persistence, SSE for live streaming. Buffer recent logs in Redis for fast retrieval.

Staff+ Signal: The pull vs. push decision is the most consequential architectural choice. Pull-based wins for cloud-native, auto-scaled environments because it decouples the control plane from the worker fleet. The server doesn't need to know which workers are online β€” it just puts jobs in queues. Workers that poll get work; workers that don't, don't. This makes horizontal scaling, failure recovery, and multi-tenancy almost free. The only cost is latency (0-50s poll interval), which is acceptable for CI/CD.


15. Common Interview Mistakes

  1. Starting with DAG when the interviewer wants linear: "First, I'll build a topological sort for the job dependency graph..." β†’ The interviewer explicitly said jobs run sequentially. You just spent 5 minutes on something they told you to skip. Staff+ answer: "I'll design for linear execution first. Each job runs after the previous one completes. The state machine extends naturally to DAG by adding a needs: field β€” I can show that if we have time."

  2. Ignoring exactly-once execution: "The worker picks up a job and runs it." β†’ What happens when the worker crashes? What if two workers pick up the same job? This is THE core question. Staff+ answer: lease-based acquisition with heartbeat, optimistic locking on state transitions, idempotency keys on steps.

  3. No multi-tenancy from the start: "All jobs go in one queue." β†’ One tenant's 10,000-job matrix expansion blocks everyone else's single-job PR check. Staff+ answer: queues partitioned by runner label, with per-tenant fair scheduling or priority tiers.

  4. Not mentioning Kubernetes/Docker: "Jobs run on the machine." β†’ How are they isolated? How do you limit CPU/memory? How do you prevent a malicious workflow from reading another tenant's secrets? Staff+ answer: ephemeral K8s pods with resource limits, network policies, and destroyed after completion.

  5. No real-time status UI: "Results are available after the workflow completes." β†’ Developers stare at CI for minutes. They need live log streaming. Staff+ answer: SSE for log streaming, WebSocket for status updates, with backpressure handling.

  6. Waiting to be asked about crash recovery: "If the worker crashes... hmm, I hadn't thought about that." β†’ Bring this up proactively. It's the whole point of the question. Staff+ answer: "The lease mechanism handles worker crashes. Let me walk through the timeline: worker acquires job, starts heartbeat, crashes at step 3, lease expires after 10 minutes, scheduler re-queues, another worker picks it up, idempotency keys prevent double-execution of completed steps."

  7. Designing a monolithic scheduler with in-memory state: "The scheduler keeps track of all running jobs in a HashMap." β†’ When this process crashes, all state is lost. Staff+ answer: stateless scheduler that reads from CDC stream. All state lives in the database. Scheduler can crash and restart without losing anything.


16. Interview Cheat Sheet

Time Allocation (45-minute interview)

Phase
Time
What to Cover

Clarify requirements

3 min

Linear vs DAG, scale (1B jobs/day), isolation model, real-time requirements

REST API design

5 min

Trigger endpoint (202 Accepted), job status, log streaming (SSE), webhook payload

High-level design

10 min

Event bus β†’ workflow engine β†’ stateless scheduler (CDC) β†’ job queues β†’ K8s runner pods

Deep dive: exactly-once

7 min

Lease-based acquisition, heartbeat + TTL, optimistic locking, idempotency keys, state machine

Deep dive: database + scaling

10 min

Schema (workflows/jobs/steps), CDC for step progression, sharding by tenant_id, partial indexes

Failure modes + trade-offs

7 min

Worker crash β†’ lease expiry β†’ re-queue, scheduler crash β†’ stateless restart, pull vs push

Wrap-up

3 min

Scaling journey summary, what I'd build next (DAG, caching, multi-region)

Step-by-Step Answer Guide

  1. Clarify: "Are jobs linear or DAG-based? I'll start with linear. How many jobs/day? 1 billion = ~12K/sec average, ~40K peak. Do we need real-time log streaming?"

  2. REST API: Show trigger endpoint (POST /workflows/{id}/dispatches β†’ 202 Accepted with run_id), job status (GET /runs/{run_id}/jobs), and log streaming (GET /jobs/{job_id}/logs/stream via SSE).

  3. Key insight: "The hard part isn't running shell commands β€” it's guaranteeing every job executes exactly once even when workers crash."

  4. Single machine: One runner, sequential execution, SQLite. Works under 1K jobs/day.

  5. Prove it fails: "58K step transitions/sec overwhelms a single database. One tenant's matrix expansion blocks everyone. A crashed worker leaves a job in limbo β€” did it complete? Do we re-run?"

  6. Distributed architecture: Event bus (Kafka) β†’ stateless workflow engine β†’ CDC-based scheduler β†’ job queues (partitioned by label) β†’ K8s runner pods. Show the mermaid diagram.

  7. Exactly-once deep dive: "Worker acquires job via conditional UPDATE with optimistic locking. Heartbeat every 60s extends 10-min lease. If worker crashes, lease expires, reaper re-queues. Steps have idempotency keys β€” re-execution of completed steps is a no-op."

  8. Database design: Show the schema with partial indexes. Explain CDC from PostgreSQL WAL via Debezium. Sharding by tenant_id for write distribution.

  9. Failure handling: Walk through worker crash timeline. Scheduler crash is a non-event (stateless, restarts from CDC offset). DB failover via synchronous replication.

  10. Scaling journey: Single runner β†’ runner pool + Redis queue β†’ K8s + Kafka + CDC β†’ sharded multi-region.

  11. Trade-offs: Pull vs push (pull wins for scale), linear vs DAG (linear first), CDC vs polling (CDC at scale), ephemeral vs persistent runners.

What the Interviewer Wants to Hear

  • At L5/Senior: Pull-based runners, job queue, basic heartbeat, container isolation. Mentions exactly-once as a concern.

  • At L6/Staff: Stateless scheduler with CDC, lease-based acquisition with optimistic locking, idempotency keys for steps, partial indexes, sharding by tenant_id. References how GitHub Actions actually works (Broker API, 10-min TTL, runner binary architecture).

  • At L7/Principal: Organizational boundaries (who owns the workflow engine vs. scheduler vs. runner fleet), multi-region job routing, tenant-level SLA design with blast radius guarantees, migration path from linear to DAG scheduling, cost modeling (spot instances, pre-baked images, cache hit rates).

Job dependency DAG execution diagram

Advanced Extension: DAG Scheduling

Once you've covered linear execution, extend to DAG with the needs: keyword:

The DAG scheduler extends the linear model:

  1. Root jobs (no needs:) are queued immediately

  2. Completion callback: When a job completes, iterate all jobs listing it in needs:. If all parents completed, queue the downstream job.

  3. Failure propagation: If a parent fails, downstream jobs are skipped (not failed). if: always() overrides this.

  4. Matrix expansion: A matrix job like {os: [ubuntu, macos]} expands into 2 jobs. needs: [build] waits for all matrix instances.

The state machine is identical β€” only the "when to queue next" logic changes. This is why designing for linear first is correct: DAG is a natural extension, not a rewrite.


Written as a reference for staff-level system design interviews. The architectural patterns described here apply beyond CI/CD to any distributed task orchestration system β€” job schedulers, data pipelines, deployment platforms, and batch processing systems.

Last updated