System Design: Designing Chess.com
Real-Time Online Chess Platform β An Evolutionary Architecture from Two Players to Millions of Concurrent Games
Table of Contents
1. Requirements & Scope
Functional Requirements
Matchmaking: Match players of similar skill (Glicko-2 rating) within seconds
Game Play: Real-time, turn-based chess with full rule enforcement (castling, en passant, promotion, check, checkmate, stalemate, draw conditions)
Time Controls: Bullet (1+0), Blitz (3+0, 5+0), Rapid (10+0, 15+10), Classical β each player has independent countdown clocks with optional increment
Reconnection: Players can rejoin a game after temporary disconnection
Game History: All games persisted in PGN format for replay and analysis
Spectating: Any game can be watched in real-time by thousands of spectators
Leaderboard: Global and time-control-specific rankings
Non-Functional Requirements
Move latency
< 200ms p99
Competitive players notice > 300ms
Clock accuracy
Β± 50ms server-side
Bullet games have 1-second decisions
Concurrent games
500K+ simultaneous
Chess.com peak during major events
Connected users
1M+ WebSocket connections
2x games (2 players + spectators)
Availability
99.95% (< 26 min downtime/month)
Games in progress cannot be interrupted
Consistency
Strong consistency per game
Both players MUST see same board state
Partition tolerance
Graceful degradation
Prefer availability with per-game consistency
Scale Estimation (Back-of-Envelope)
These numbers frame our entire scaling journey. We won't need to handle 500K concurrent games on day one β but every architectural choice we make should have a path toward it. Let's start from the very beginning.
2. Stage 1: Single Server β "Two Friends Playing Chess"
Target: ~10 concurrent games, proof of concept
At this stage, we're building the simplest thing that could possibly work. One server, one process, everything in memory.
Architecture
Stack:
Single Node.js/Express server
socket.iofor WebSocket connectionschess.jsfor move validationPostgreSQL for persisting completed games
Map<gameId, GameState>in-memory for active games
Game State Data Structure (in-memory)
Why FEN + Move List (not just FEN)?
FEN alone captures current position but not history
Move list enables: threefold repetition detection, 50-move rule, game replay
This is essentially event sourcing β the move list IS the event log, FEN is the materialized view
Chess Move Validation Engine
The server-side chess engine must handle all rules:
Basic moves: Each piece type has specific movement patterns
Captures: Including en passant (requires knowing the previous move)
Castling: Requires tracking whether king/rooks have moved (FEN castling flags)
Pawn promotion: Must specify promotion piece
Check/Checkmate/Stalemate detection: After each move, compute all legal moves for the opponent. Zero legal moves + in check = checkmate. Zero legal moves + not in check = stalemate.
Draw conditions:
Threefold repetition (same position 3 times β requires full move history)
50-move rule (50 moves without pawn move or capture β tracked in FEN halfmove clock)
Insufficient material (K vs K, K+B vs K, K+N vs K)
Draw by agreement (both players agree)
Implementation Choice: Library vs. Custom
Use chess.js (Node) or python-chess
Battle-tested, handles all edge cases
Language-locked, dependency risk
Custom engine in Go/Rust
Full control, optimized for server
Months of work, bug-prone
Recommended: Wrap proven library
Best of both: correctness + performance tuning
Slight abstraction overhead
For production: wrap chess.js or python-chess in a service, with bitboard optimizations for hot paths (legal move generation).
Move Validation Pipeline
Server-Authoritative Clocks (Even at Stage 1)
Even at this stage, the clock must be server-authoritative. Clients cannot be trusted.
Why Server-Authoritative (not client clocks)?
Clients can be manipulated (hacked clocks, artificial lag)
Network latency varies between players β one player shouldn't be penalized
A tampered client could claim "I still had 5 seconds"
How Server-Authoritative Clocks Work:
Clock is NEVER "running" on the server. It's a "last snapshot + elapsed" calculation. This is the lazy evaluation pattern β and it's critical even at small scale because it means we never need background timers per game.
What Works at Stage 1
Simple, fast to build
In-memory state = lowest possible latency
Single process = no coordination needed
Perfect for prototyping and validating game logic
What Breaks β Why We Need Stage 2
Server crash = all games lost (in-memory state is gone)
Single machine limits: ~10K WebSocket connections
No redundancy: one server goes down, the entire platform is offline
No horizontal scaling: can't add more servers to handle more players
Once you have more than a handful of concurrent players, you need to survive a server restart without losing everyone's games. That leads us to Stage 2.
3. Stage 2: Shared State β "Surviving Server Crashes"
Target: ~1K concurrent games
The Problem
In Stage 1, game state lives in a Node.js process's memory. If that process crashes β or you need to deploy a new version β every active game is destroyed. You also can't run multiple servers because two players in the same game might connect to different servers, and neither server has the other's game state.
The Solution: Redis as External State Store
Move game state out of server memory and into Redis. Servers become stateless processors that read from and write to Redis. Multiple servers can now share the same game state.
What Changed from Stage 1
Game state
In-memory Map
Redis (external)
Server count
1
2-3 behind load balancer
Load balancing
None
Nginx with sticky sessions (IP hash)
Cross-server messaging
N/A
Redis Pub/Sub
Crash recovery
Games lost
Games survive (state in Redis)
Why Redis?
When two players are on different servers, both servers need access to the same game state. Redis gives us:
Sub-millisecond reads: Game state access in < 1ms
Atomic operations:
MULTI/EXECfor safe concurrent updatesPub/Sub: Cross-server message routing for broadcasting moves
Move Validation Pipeline (Updated for Redis)
Event Sourcing: A Natural Fit for Chess
Chess is inherently event-sourced. The sequence of moves IS the complete history. The current board position (FEN) is a derived materialized view of the move list.
Benefits of event sourcing at this stage:
Complete audit trail: Every move with timestamps β essential for anti-cheat
Game replay: Play back any game move by move
Crash recovery: Replay events to reconstruct state on any server
Spectator catch-up: New spectator can receive the event log to sync up
Analysis: Post-game engine analysis per move
Performance Optimization: Snapshots
For long games (60+ moves), replaying from scratch is expensive. Use periodic snapshots:
Redis Data Model
Sticky Sessions for WebSocket
WebSocket is a persistent connection. If Player A's WebSocket connects to Server 1, that connection MUST stay on Server 1 for the lifetime of the game. IP hash ensures this:
Cross-Server Message Routing via Redis Pub/Sub
When White (on Server 1) makes a move, Black (on Server 2) must receive it:
What Works at Stage 2
Crash resilience: Server crashes don't lose game state
Horizontal scaling: 2-3 servers behind a load balancer
~1K concurrent games: Comfortably handled
Event sourcing: Full game history for replay and recovery
What Breaks β Why We Need Stage 3
Single Redis instance: One Redis server is a single point of failure and a scaling bottleneck
Monolith coupling: The game server handles everything β WebSocket connections, move validation, matchmaking, spectating. These have very different scaling characteristics:
WebSocket connections are memory-bound (~20-50KB per connection)
Move validation is CPU-bound
Matchmaking is I/O-bound
Clock fairness at scale: With thousands of games, we need a systematic approach to timeout detection β not per-game timers
Deployment friction: Updating matchmaking logic requires redeploying the entire server
4. Stage 3: Separation of Concerns β "Real-Time at Scale"
Target: ~50K concurrent games
The Problem
The monolith server does everything: WebSocket management, move validation, matchmaking, spectator broadcasting, and timer management. At 50K concurrent games with 100K+ WebSocket connections, these responsibilities have fundamentally different scaling needs:
WebSocket connections
Memory (20-50KB each)
RAM
Move validation
CPU (chess engine)
Compute
Matchmaking
I/O (Redis queries)
Queue throughput
Spectator broadcast
Bandwidth (fan-out)
Network
You can't efficiently scale a monolith when one part needs more memory while another needs more CPU.
The Solution: Microservices Decomposition
Split the monolith into specialized services, each scaled independently.

Layer Breakdown
Edge Layer β CDN (CloudFlare) for static assets (board images, piece sprites, JS bundles). GeoDNS routes players to the nearest regional cluster.
Gateway Layer β Two distinct gateways:
API Gateway: Handles REST requests (auth, matchmaking initiation, game history queries). Stateless, horizontally scalable.
WebSocket Gateway: Maintains persistent connections. Stateful by nature (holds TCP connections), so requires sticky sessions and special scaling patterns.
Core Services Layer β Each service owns a single domain:
Game Service: The brain β validates moves, manages game state machine, enforces rules
Matchmaking Service: Pairs players by rating, time control, and fairness criteria
Timer Service: Server-authoritative clock countdown (critical for fairness)
Spectator Service: Fan-out engine for broadcasting to watchers
Data Layer β Polyglot persistence:
Redis Cluster: Active game state (hot path), WebSocket session registry, matchmaking queues, leaderboard sorted sets
PostgreSQL: Users, ratings, completed game records (cold path, durable)
Kafka: Event bus for decoupled communication β rating updates, analytics, anti-cheat signals
Why This Layering?
The key architectural insight for chess (vs. FPS games) is that each game is independently stateful but low-throughput. A chess game produces ~0.2 moves/second on average. The challenge isn't per-game throughput β it's the sheer number of simultaneous independent games with strong per-game consistency requirements.
This means:
Games can be trivially sharded by
gameIdβ no cross-game state dependenciesEach game server can manage hundreds of concurrent games
Horizontal scaling is straightforward: more game servers = more games
Deep Dive: WebSocket Gateway Separation

Each WebSocket connection consumes ~20-50KB of memory (TCP buffers, TLS state, application-level session data). At 1M connections:
Why WebSockets (not HTTP polling / SSE)?
HTTP Polling
ClientβServer
High (headers per request)
1-5s (poll interval)
Too slow for chess
Long Polling
ClientβServer (held open)
Moderate
~500ms
Acceptable but wasteful
SSE
ServerβClient only
Low
Real-time
Can't send moves upstream
WebSocket
Bidirectional
Minimal (2-byte frame header)
Real-time
Perfect for chess
Chess requires bidirectional real-time β both players send moves AND receive opponent's moves. WebSocket is the clear winner.
Key design: Separate WS Gateways from Game Servers
This separation means:
A WS gateway crash only drops connections (clients reconnect)
A game server crash doesn't drop any connections
Can scale each independently
Deep Dive: Cross-Gateway Message Routing (Redis Pub/Sub)
Both players in a game might be on DIFFERENT WS gateways. When White moves, the message must reach Black on a different gateway.
Solution: Redis Pub/Sub per game
At 500K games, that's 500K Redis channels. Redis handles this well β each channel is a linked list of subscribers. Memory cost is ~100 bytes per subscription.
Connection Lifecycle
Deep Dive: Server-Authoritative Clocks at Scale

At 50K concurrent games, the "lazy evaluation" pattern from Stage 1 becomes essential β and we add a timeout sweep for detecting time-outs without running individual timers.
Why Not Individual Timers?
At 500K concurrent games, if each game has an active timer, that's 500K setTimeout() calls. Even if each fires only when time expires:
JavaScript's event loop handles timers in a min-heap, but 500K entries cause GC pressure
Timer resolution in Node.js is ~1ms, but accuracy degrades under load
Each game server might handle 5K-10K games β still manageable with careful design
The "Lazy Evaluation" Pattern:
Instead of running a timer for every game, compute time remaining on-demand:
But how do we detect timeouts if no one moves?
If a player's clock runs out and they simply don't move, the server must still end the game. This requires the Timeout Sweep:
On every move, update the sorted set:
Time Control Variants
Bullet
1-2 min
0-1s
1+0
Extremely time-sensitive, < 50ms accuracy needed
Blitz
3-5 min
0-2s
5+0
Most popular, moderate time pressure
Rapid
10-15 min
0-10s
15+10
Longer, increment adds complexity
Classical
30+ min
varies
30+30
Less latency-sensitive, but longer server resource hold
Increment handling:
Handling Network Latency Compensation
When a player's move arrives at the server, some time was spent in transit. Should this transit time count against the player?
Count transit time
Simple β just use server timestamps
Penalizes players with bad connections
Subtract estimated RTT
Server tracks avg RTT per player, deducts half
Fairer, but gameable
Hybrid (recommended)
Cap compensation at e.g., 200ms, based on measured RTT
Balance between fairness and anti-cheat
Chess.com uses a server-timestamp approach with some compensation β they don't penalize obvious network delays but also don't allow clients to claim arbitrary timestamps.
Client-Side Clock Display
Even though the server is authoritative, the client must show a smooth countdown:
After receiving a server clock update, set local clock to server value
Run a local
setInterval(100ms)decrementing the active clockWhen the next move arrives from server, override local values with server values
Visual discrepancy is typically < 100ms β imperceptible
What Works at Stage 3
Independent scaling: WS gateways scale by memory, game servers by CPU
~50K concurrent games: Redis Cluster with 3 shards handles the throughput
Fault isolation: A game server crash doesn't drop WebSocket connections
Independent deployment: Update matchmaking without touching game logic
Kafka event bus: Decoupled downstream processing (archival, analytics, ratings)
What Breaks β Why We Need Stage 4
Single matchmaking queue bottleneck: One matchmaker processing all players becomes a bottleneck at peak load
Tight coupling in matchmaking: The matchmaker directly writes to Redis and creates games β no async pipeline
No game archival pipeline: Completed games need to flow from Redis β PostgreSQL β S3 reliably
Matchmaking fairness: Simple rating range Β± 25 doesn't account for rating uncertainty (Glicko-2 RD)
5. Stage 4: Distributed Matchmaking & Event-Driven Architecture
Target: ~200K concurrent games
The Problem
At 200K concurrent games, matchmaking is processing thousands of match requests per second. A single matchmaking process scanning a Redis sorted set becomes a bottleneck β especially during peak hours when thousands of players queue simultaneously for popular time controls like blitz 5+0.
Additionally, completed games need a reliable pipeline to flow from Redis (hot storage) β PostgreSQL (warm) β S3 (cold archive), and rating updates need to happen asynchronously without blocking game flow.
The Solution: Kafka Event Bus + Sharded Matchmaking + Glicko-2

Deep Dive: Matchmaking Queue Architecture
Rating System: Glicko-2
Chess.com transitioned from Elo to Glicko-2 (developed by Mark Glickman). Glicko-2 adds two dimensions beyond a single number:
Rating (r): The skill estimate (e.g., 1500)
Rating Deviation (RD): Uncertainty β new/inactive players have high RD (system is unsure about their true rating); active players have low RD
Volatility (Ο): How consistently the player performs
Why Glicko-2 over Elo?
Elo treats a player who plays 1000 games the same as someone who played 10 β their rating has equal "confidence"
Glicko-2's RD means inactive players' ratings are "less certain," so matchmaking can account for this
Elo can create rating inflation/deflation pools; Glicko-2 self-corrects
Matchmaking Queue Data Structure:
Distributed Matchmakers β Sharding Strategy:
With 500K concurrent seekers, a single matchmaker becomes a bottleneck. Solutions:
Single queue, sharded by time control
Each time control has its own queue + dedicated matchmaker
Simple, no coordination
Popular controls (blitz 5+0) still hot
Partitioned by rating range
Shard queue by rating buckets (0-1000, 1000-1500, etc.)
Parallelizes well
Boundary players might miss cross-shard matches
Regional + global
Match within region first, escalate to cross-region after timeout
Low latency matches
Cross-region adds complexity
Recommended: Hybrid
Shard by time_control Γ rating_bucket, with a "boundary overlap" of Β± 50
Best parallelism
Slightly complex boundary handling
Color Assignment
Chess.com doesn't randomly assign colors. It tracks your recent color history and balances:
If you played White last 3 games, you get Black
If both players have same imbalance, random assignment
Deep Dive: Game State Archival Pipeline (Redis β PostgreSQL β S3)
What Works at Stage 4
Parallel matchmaking: Sharded by time_control Γ rating_bucket
Reliable event pipeline: Kafka ensures no game data is lost between Redis and PostgreSQL
Decoupled services: Rating updates, analytics, anti-cheat all consume events independently
~200K concurrent games: Infrastructure scales horizontally
What Breaks β Why We Need Stage 5
Single region: All servers are in one data center. Players in Asia connecting to US servers experience 200ms+ latency
Global latency: For competitive bullet chess, even 150ms cross-continent latency is noticeable
Regional failure: If the one data center goes down, the entire platform is offline
6. Stage 5: Global Scale β "Chess.com Level"
Target: ~500K+ concurrent games
The Problem
With a single-region deployment, players on the other side of the world suffer from high latency. A bullet chess player in Tokyo connecting to US-East servers faces ~150ms round-trip time β that's 150ms of their 60-second clock consumed just by network transit on every move. And if that single region goes down, the entire platform disappears.
The Solution: Multi-Region Deployment

Infrastructure at Global Scale
Deep Dive: Where Does the Game Live?
This is the most critical design decision for a multi-region chess platform:
Why not replicate game state across regions? Because chess requires strong per-game consistency β both players must see the exact same board. Cross-region replication introduces latency that's worse than just having one player with slightly higher latency to a single region.
Deep Dive: Database Sharding Strategy
Deep Dive: Observability at Scale
Key Metrics to Monitor
Game Health:
Active games count (should match expected from matchmaking rate)
Average move processing time (target: < 50ms p99)
Clock accuracy (server time vs. measured elapsed)
Games ended by timeout vs. checkmate vs. resignation (shifts in ratio = possible issues)
Infrastructure Health:
WebSocket connection count per gateway
Redis memory usage and command latency
Kafka consumer lag (if growing β archival is falling behind)
Game server CPU utilization
Player Experience:
Matchmaking wait time (p50, p95, p99 by rating tier)
Move round-trip latency (client β server β opponent client)
Reconnection rate (high rate = network issues or bugs)
Game abandonment rate
Alerting Thresholds
Move processing p99
> 100ms
> 500ms
WS connections per pod
> 40K
> 55K
Redis memory
> 70%
> 85%
Kafka consumer lag
> 10K messages
> 100K messages
Matchmaking wait p95
> 15s
> 30s
Game server error rate
> 0.1%
> 1%
7. Failure Modes & Resilience
These failure modes apply primarily to Stage 3+ architectures, where the system is distributed enough for partial failures to be meaningful.

Failure Matrix
Client network blip
1 player disconnected
WS heartbeat timeout (30s)
Auto-reconnect with backoff
< 5s typical
Client crash
1 player gone
WS close event
Grace period β forfeit
60s grace
WS Gateway crash
~50K connections dropped
K8s liveness probe
Clients reconnect to other gateways
< 10s
Game Server crash
Games on that server affected
Health checks
New server loads state from Redis
< 5s
Redis primary failure
Active games at risk
Redis Sentinel
Auto-failover to replica (< 5s)
< 5s
Redis cluster partition
Some shards unavailable
Cluster health monitoring
Repair partition or failover
Variable
PostgreSQL failure
No new game archival, no auth
PG replication
Failover to replica
< 30s
Kafka broker failure
Event processing delayed
Consumer lag monitoring
Kafka replication handles it
< 10s
Full region failure
All games in region lost
Cross-region health checks
Players reconnect to other region
Minutes
Game Server Crash Recovery (Why Stateless Servers + Redis Works)
This is the most interesting failure mode. When a game server crashes mid-game:
What if a move was "in-flight" during the crash?
Redis Failure + Kafka as Backup Event Log
Redis Sentinel (for single-node Redis):
Sentinel monitors Redis primary
On failure, promotes a replica to primary
Clients get notified of the new primary endpoint
Data loss risk: Async replication means the last few writes might be lost
Mitigation:
WAIT 1 0command can force sync replication (adds latency)
Redis Cluster (for sharded deployment):
Data sharded across multiple primaries by hash slot
Each primary has 1+ replicas
If a primary fails, its replica takes over automatically
Partial failure: Only games on the failed shard are affected
What if Redis loses a game's state?
Client Reconnection Protocol
8. Design Trade-offs & Alternative Choices
Trade-off 1: Game State Storage
In-memory (per game server)
Each server holds game state in a HashMap
Fastest reads/writes
Lost on crash, can't share across servers
Stage 1 only
Redis (external)
Game state in Redis, servers are stateless
Survives crashes, shareable, fast
Network hop per operation (~0.5ms)
Recommended (Stage 2+)
Database (PostgreSQL)
Game state directly in relational DB
Durable, transactional
Too slow for hot path (~5-20ms per query)
For archival only
Actor model (Akka/Orleans)
Each game is an actor with persistent state
Natural model, built-in recovery
Complex infrastructure, vendor lock-in
Valid alternative
Trade-off 2: Event Sourcing vs. State Snapshotting
Pure Event Sourcing
Store only move events, derive state by replay
Perfect audit trail, replay, crash recovery
Slow state reconstruction for long games
Pure State Snapshot
Store only current FEN + clocks
Fastest reads
No history, no replay, no audit
Hybrid (recommended)
Store both: event log (moves) + latest state snapshot (FEN+clocks)
Fast reads + full history
Slightly more storage (~3KB extra per game)
The hybrid approach is overwhelmingly the right choice. Storage is cheap. The move list IS the event log; it's what chess players have always recorded.
Trade-off 3: WebSocket vs. gRPC Streaming
WebSocket
Yes
Native
Very low
Sticky sessions (L4)
gRPC streaming
Yes
Needs grpc-web proxy
Low
Complex (HTTP/2 streams)
Socket.IO
Yes (wraps WS)
Yes
Moderate (extra protocol layer)
Adapter needed (Redis)
Verdict: Raw WebSocket for the game protocol (minimal overhead), with Socket.IO as a fallback for environments where WebSocket doesn't work. gRPC is better for server-to-server communication (e.g., game server β timer service).
Trade-off 4: Centralized vs. Distributed Game Servers
Centralized game server
Single service handles all game logic
Simple, consistent
Single point of failure, scaling limit
Distributed, partitioned by gameId
Any server can handle any game (state in Redis)
Horizontally scalable, fault-tolerant
Need Redis as coordination layer
Distributed with locality
Assign games to servers, with failover
Better cache locality
Complex routing, rebalancing needed
Recommended: Distributed, partitioned by gameId. Since game state lives in Redis, any game server can process any game's moves. This gives maximum flexibility and simplest failover.
Trade-off 5: Push vs. Pull Clock Synchronization
Push on every move
Server sends authoritative clocks with each move event
Simple, syncs on natural cadence
Between moves, client clock drifts
Periodic push
Server broadcasts clock sync every 1s
More accurate between moves
500K games Γ 1 msg/sec = high bandwidth
Pull (client requests)
Client polls server for clock values
Client-controlled
Adds load, introduces latency
Hybrid (recommended)
Push with moves + client local countdown
Accurate at move boundaries, smooth display
Minor drift between moves (acceptable)
9. Data Model & Storage Architecture
PostgreSQL Schema (Simplified)
Sharding Strategy (at Chess.com Scale)
10. Key Design Principles
Server is the single source of truth β Clients are untrusted. All game state, move validation, and clock management is server-authoritative.
Stateless compute, external state β Game servers hold no in-memory state. Redis is the state store. Any server can handle any game.
Event sourcing is natural β The move list is the event log. FEN is the materialized view. This gives you replay, recovery, and audit for free.
Games are independent β No cross-game state means trivial horizontal scaling via sharding by gameId.
Lazy clock evaluation β Don't run 500K timers. Compute remaining time on-demand when moves arrive, with a sweep for timeout detection.
Separate WebSocket from game logic β WS gateways are memory-bound; game servers are CPU-bound. Scale independently.
Design for failure β Every component can crash. Redis survives game server crashes. Kafka survives Redis failures. Clients reconnect gracefully.
Start simple, scale when needed β Chess.com itself ran on a single MySQL database for years. Don't over-engineer day one. Add Redis, then sharding, then multi-region as traffic demands.
Last updated