System Design: Distributed Crossword Puzzle Solver
From One Machine to a Fleet of Workers β A Staff Engineer's Guide
Table of Contents
1. The Problem & Why It's Hard
You're asked to build a service that solves crossword puzzles. You're given:
A board (~50Γ50 grid) with about 100 slots (each has a position, direction, and length)
A dictionary of ~1 million words
A constraint: wherever two slots cross, the letters at the intersection must match
On the surface, it sounds like "just a search problem." The trap is underestimating the search space.
The interviewer's real question: Can you prove a single computer is too slow, and then design a distributed system that splits the work intelligently, handles dead ends fast, and moves tasks around if workers get stuck?
This is not an algorithm puzzle. It is a distributed systems design challenge.
2. Requirements & Scope
Functional Requirements
Submit a puzzle: Accept a grid layout (slot definitions) and return a solved board
Partial solve support: Return the best partial solution if unsolvable (or unsolvable in time)
Async solving: For large puzzles, return a job ID and poll/subscribe for results
Dictionary management: CRUD operations on the word dictionary
Cancel solving: Allow clients to cancel in-flight solve requests
Non-Functional Requirements
Solve latency (simple puzzle)
< 5s p99
Interactive use cases
Solve latency (hard puzzle)
< 60s p99
Batch/asynchronous acceptable
Dictionary size
1M words
Realistic production corpus
Concurrent solve jobs
1,000+
Multi-tenant SaaS
Availability
99.9%
~8.7 hr/year downtime budget
Correctness
100%
A wrong answer is worse than no answer
Idempotency
Yes
Same puzzle, same result (given same dictionary)
Horizontal scalability
Yes
Workers scale independently of API
Scale Estimation (Back-of-Envelope)
3. The Algorithm: Constraint Satisfaction Problem
Before designing the system, understand what we're asking computers to do.
3.1 Formalization as CSP
A crossword puzzle is a Constraint Satisfaction Problem (CSP):
Variables: Each slot
S_i(e.g.,2-ACROSS,7-DOWN)Domain: The set of candidate words that fit
S_i's length (initially thousands of words)Constraints: For every pair of crossing slots
(S_i, S_j), the letter at the intersection must match
3.2 Backtracking with Constraint Propagation
The standard solve algorithm:
Key optimizations that make this tractable:
MRV (Minimum Remaining Values): Always pick the slot with the fewest remaining candidates. Fail fast rather than exploring large branches that will fail later.
Arc Consistency (AC-3): After assigning a word to a slot, propagate constraints to all crossing slots. Remove candidates that are now incompatible. If any slot's candidate list drops to zero, backtrack immediately without exploring further.
Candidate indexing: Pre-build an index:
by_length[n] β all words of length n, andby_pattern["A?P?E"] β [APPLE, AMPLE, ...]. Constraint propagation becomes a set intersection, not a linear scan.
4. Phase 1: Single Machine Solver
The simplest possible design β everything on one server.
Implementation sketch (Python pseudocode):
When does Phase 1 work?
Small puzzles (< 20 slots)
Strong constraints (few valid candidates per slot)
Offline batch processing with unlimited time budget
When does Phase 1 fail? See next section.
5. Why One Machine Fails (The Math)
Let's quantify the problem honestly.
The Worst Case: A "Flat" Puzzle
Imagine a puzzle where the first slot has 10,000 candidates (a common 5-letter slot), the second has 8,000, and so on β with weak constraints between slots (they barely cross). The search tree looks like:
Even with aggressive pruning, a worst-case estimate:
CPU Bound vs. Memory Bound
Search tree too deep
Backtracking takes hours
Split tree across workers
Memory for candidate state
100 slots Γ 10K candidates = ~50MB
Each worker holds only its subtree
CPU for constraint propagation
1 core Γ repeated intersection ops
N workers Γ parallel propagation
No early termination
Must explore all branches
Workers can signal "solution found" and cancel peers
The Tipping Point
A well-optimized single-threaded solver can handle ~10M node evaluations per second.
A hard 100-slot puzzle might require exploring 10^15 nodes in the worst case (even with pruning).
With 1,000 parallel workers, that's 3 years / 1,000 = ~26 hours. Still bad. The key insight: we need smarter decomposition, not just brute-force parallelism.
6. Phase 2: Distributed Architecture
The key architectural insight: turn the recursive search tree into a work queue.
Instead of one worker recursively exploring everything, we:
Start with an initial "empty assignment" work item in the queue
Workers pull items, explore a few levels, then split promising branches back into the queue
When any worker finds a solution, it broadcasts "done" and all other workers for this puzzle stop
High-Level Architecture
The Work Item: The Key Data Structure
Every work item encodes a snapshot of solver state β enough for any stateless worker to resume:
Why encode full constraint state? Workers are stateless. They need everything to resume. Sending just the partial assignment would force each worker to re-run constraint propagation from scratch β expensive and wasteful. Encoding the already-pruned candidate lists means workers can jump straight to exploring, not re-deriving.
Work Splitting: The Parallelism Engine
The rule: a worker explores at most SPLIT_THRESHOLD levels of DFS, then splits remaining branches back to the queue. This balances:
Worker efficiency (don't split so aggressively that each item is 1 node)
Parallelism (don't let one worker hold all the work)
A good SPLIT_THRESHOLD is 3β5 levels β enough to propagate constraints meaningfully before splitting.
7. Core Component Deep Dives
7.1 Dictionary Service
The dictionary is read-heavy and changes infrequently. Design for fast candidate lookup.
Data structures:
Trie (prefix tree): Supports pattern matching.
A??LEβ walk trie, wildcard = explore all childrenBy-length index:
words_by_length[n]βlist[str]. Used to seed initial candidate lists.Pattern bitmap index: For each (position, letter) pair, a bitset of matching words. Constraint propagation = bitwise AND.
Cache strategy: The entire 1M-word dictionary fits in ~8MB of RAM. Load everything into memory at service startup. No per-request DB hits. Dictionary updates invalidate the in-memory cache and reload asynchronously (brief window of stale candidates is acceptable β worst case: a word appears that wasn't in the old dict or vice versa).
7.2 Coordinator Service
The coordinator is the brains of the distributed solve. It manages puzzle lifecycle.
Responsibilities:
Create puzzle job, assign jobId
Parse slots, fetch initial candidate lists from Dictionary Service
Push initial work item to queue
Monitor progress via heartbeats from workers
Detect completion (worker published a solution)
Cancel all remaining work items for a completed puzzle
Handle timeout: if no solution found in
MAX_SOLVE_TIME, mark job as failed/partial
Coordinator is NOT in the critical path of workers β workers pull from the queue and push results directly. The coordinator only listens to result events. This keeps coordinator load low and avoids it being a bottleneck.
7.3 Solver Workers
Workers are the stateless computational workhorses.
Worker lifecycle:
Worker pseudocode:
7.4 Work Queue
The work queue is the spine of the distributed system.
At-least-once delivery
Workers must not drop items
Redis BRPOPLPUSH or Kafka consumer groups
Leasing
Work item locked to one worker at a time
Redis key expiry as lease timeout
Priority
More constrained (smaller search space) first
Redis sorted set (ZADD + ZPOPMAX)
Puzzle cancellation
Stop all items for a puzzle efficiently
Redis pub/sub cancel signal; workers check a "cancelled" key
Visibility timeout
If worker dies, item re-appears
SQS visibility timeout or manual re-queue after TTL
Why Redis over Kafka for work queue? Work items need random deletion (cancel all items for puzzle P), priority ordering, and lease management. These are natural for Redis sorted sets + hashes. Kafka is better for ordered event streams, not work-stealing queues.
8. The Scaling Journey
Stage 1: Monolith (0β100 req/day)
Everything in a single process. Good for prototyping.
Limit: CPU-bound for hard puzzles. One slow solve blocks all others.
Stage 2: Separate Workers (100β1,000 req/day)
Decouple API from compute. Workers run in separate processes.
Limit: Dictionary loaded per-worker (wasteful), no smart work-splitting yet, one Redis is a SPOF.
Stage 3: Smart Splitting + Shared Dictionary (1Kβ10K req/day)
Add the work-splitting algorithm, dedicated dictionary service, and result streaming.
New capabilities at this stage:
Workers split work items at
depth == 3, pushing siblings backResult store uses pub/sub β coordinator and API are notified immediately on solve
Auto-scaling: worker count scales with queue depth (Kubernetes HPA)
Dictionary service serves candidates via gRPC; workers cache locally per-slot per-puzzle
Stage 4: Enterprise Scale (10K+ req/day, Multi-Tenant SaaS)
Full enterprise architecture with multi-region, tenant isolation, and observability.
Enterprise additions:
Multi-region workers: Hard puzzles can overflow work to a second region's workers
Tenant isolation: Each tenant's work items carry a
tenantId; high-priority tenants get a dedicated queue partitionSpot VMs: Workers are stateless β perfect for spot/preemptible instances (70% cost savings)
Leader election: Two coordinator replicas, one leader (Zookeeper / etcd-based), zero SPOF
9. Failure Modes & Resilience
Request Flow with Failure Handling
Failure Scenarios and Mitigations
Worker pod crashes
Lease TTL expires (30s)
Work item re-queued automatically
Coordinator crashes
K8s liveness probe
Leader re-elected from replica; jobs reloaded from PostgreSQL
Redis OOM
Memory alerts, eviction policy
Bounded queue per puzzle; backpressure to API (429)
Dictionary service down
Worker health check fails
Workers use local in-process cache (stale but functional)
Runaway puzzle (unsolvable)
Max TTL on job (e.g., 120s)
Coordinator marks as TIMEOUT, returns best partial
Duplicate work items
Idempotency keys on result publish
Result store deduplicates with SET NX
Wrong solution published
Solution validator service
Re-validate against all constraints before writing to RS
Work Item Re-queue Logic (Critical)
10. Data Model & Storage
Job Store (PostgreSQL)
Work Item (Redis β ephemeral)
Dictionary Store (PostgreSQL + in-memory)
Cache warm-up at Dictionary Service startup:
11. Observability & Operations
Key Metrics
Work Queue Health:
queue_depth{puzzle_id}β items waiting; alerts if growing unboundedlease_age_p99β p99 age of in-flight items; high = workers are slowrequeue_rateβ items re-queued after lease expiry; high = worker instability
Solver Performance:
solve_latency_seconds{status}β p50/p99 by outcome (SOLVED, UNSOLVABLE, TIMEOUT)dead_end_rateβ fraction of work items ending in dead ends; high = poor splitting heuristicnodes_per_second{worker_id}β throughput per worker
Business Metrics:
jobs_per_minute{tenant}β tenant usage trackingunsolvable_rateβ fraction of puzzles with no solution (data quality issue?)partial_solution_rateβ puzzles solved partially (quality signal)
Distributed Tracing
Every work item carries a traceId (same as the jobId). When workers pull items, they propagate the trace context. This lets you see the full tree of work items for a puzzle in Jaeger/Zipkin:
12. Design Trade-offs
Split Threshold: Depth vs. Breadth
1 (pure BFS)
Every node is a work item
Maximum parallelism, but massive queue overhead; can't benefit from local constraint propagation
3β5
Balanced
Worker does meaningful local work + propagation before splitting
β (pure DFS)
One worker does everything
No parallelism; single worker holds all work
Recommended: SPLIT_THRESHOLD = 4. Allows 3 rounds of AC-3 constraint propagation before splitting, dramatically reducing the size of sub-trees pushed to the queue.
Work Item Size
As partialAssignment grows, work items grow (encoding candidate lists for remaining slots). At depth 50 of 100 slots, a work item could be ~25KB.
Option A: Encode full constraint state (default β no re-computation)
Option B: Encode only the partial assignment; workers re-derive constraints from scratch
Pro: smaller items, simpler code
Con: ~O(nΒ²) wasted AC-3 re-computation per item
Recommended: Option A for puzzles < 100 slots. For larger puzzles, compress with zstd (5β10Γ ratio on repetitive word list data).
Queue Backend: Redis vs. Kafka vs. SQS
Redis sorted set
Microsecond enqueue/dequeue, priority, custom TTL, pub/sub
Memory-bound, no persistence guarantees
Default choice β fast and flexible
Kafka
Durable, replay, excellent monitoring
No built-in priority, complex consumer group management
When durability of in-flight work matters more than speed
AWS SQS
Fully managed, visibility timeout built-in, FIFO available
No priority, 256KB message limit, ~10ms latency
Good for cloud-native; if work items are small
Recommended: Redis sorted set for latency-sensitive use (< 5ms enqueue), SQS for durability-first teams.
When to Give Up: Timeout Strategy
Some crossword configurations have no solution (dictionary is missing a required word). The system must not run forever.
Summary: Interview Cheat Sheet
If asked to design this in an interview setting, structure your answer:
Clarify: ~50Γ50 grid, 100 slots, 1M dictionary, need first valid solution
Algorithm first: CSP backtracking with MRV + AC-3. This is the engine.
Prove single machine fails: Branching factor Γ depth = exponential. Hard puzzles = hours.
The insight: Turn recursive DFS into a work queue. Each work item = subtree root.
Work splitting: Workers explore N levels then push siblings back to queue.
Components: API β Coordinator β Work Queue (Redis) β Workers β Result Store
Failure handling: Lease TTL re-queues dead worker items. Cancel key stops orphaned work.
Scale levers: Worker pod count, split threshold, priority queue for MRV-guided work.
Trade-off: Statefulness of work items (encode constraint state vs. recompute).
Don't forget: Solution validation, dictionary caching, observability.
The crossword solver is a beautiful example of how a well-understood algorithm (backtracking CSP) becomes a distributed systems challenge at scale. The algorithm itself is not the hard part β decomposing the search tree into independently executable work items, managing their lifecycle, and handling failures gracefully is where the real engineering lives.
Written as a reference for staff-level system design interviews. The patterns here β work queue-based search, stateless workers with lease management, and priority-driven decomposition β apply broadly to any distributed search or optimization problem.
Last updated