System Design: In-Memory Database
From Hash Tables to Tiered Storage β A Staff Engineer's Guide to the Four-Part Interview
Table of Contents
1. The Problem & Why It's Hard
This interview question typically comes in four escalating parts β each one raises the bar:
1
Implement an in-memory database
Data structures, API design, concurrency
2
Support efficient querying with inverted indexes
Search fundamentals, index maintenance cost
3
Handle datasets too large for memory
Tiered storage, buffer management
4
Keep hot data in memory, preload frequent data
Caching algorithms, access pattern analysis
Most candidates finish Parts 1 and 2. Parts 3 and 4 are where staff+ candidates separate themselves.
The interviewer's real question: Can you design a system that delivers microsecond reads when data fits in memory, millisecond reads when it doesn't, and degrades gracefully as the dataset grows β all while maintaining correctness under concurrent access?
The trap is treating this as "just put a hash map behind an API." The real challenge is threefold:
Concurrency without destroying latency: The whole point of in-memory is speed. A global lock turns your microsecond database into a millisecond one.
Querying beyond key-value: Adding secondary indexes (inverted indexes) means every write now updates multiple data structures atomically.
The memory cliff: When data exceeds memory, you don't just "add disk." You need a tiered architecture that's transparent to the caller and doesn't sacrifice the in-memory performance for hot data.
Staff+ Signal: The real interview insight is that Parts 1-4 aren't four separate systems β they're one system at four stages of maturity. A staff engineer designs Part 1 with the foresight that Parts 3 and 4 are coming. The storage engine abstraction you choose in Part 1 determines whether Part 3 is a refactor or a rewrite.
2. Requirements & Scope
Functional Requirements
CRUD operations: Create table, insert row, select by primary key, update, delete
SQL-like interface:
CREATE TABLE,INSERT INTO,SELECT ... WHERESecondary indexes: Inverted index for efficient querying on non-primary-key columns
Multi-column queries: Support
WHERE col1 = x AND col2 = yPersistence: Survive process restarts (optional but expected at staff+ level)
Transactions: At minimum, atomic single-row operations; ideally multi-row ACID
Non-Functional Requirements
Point read latency (in-memory)
< 10 ΞΌs
Hash table lookup + minimal overhead
Point read latency (on-disk)
< 1 ms
SSD random read + deserialization
Write latency
< 50 ΞΌs
Append to WAL + update in-memory structures
Query latency (indexed)
< 1 ms
Inverted index lookup + row fetch
Throughput
100Kβ1M ops/sec
Single node, depends on concurrency model
Availability
99.99%
Data loss = user trust loss
Scale Estimation (Back-of-Envelope)
3. Phase 1: The In-Memory Data Store
This is LeetCode "Design SQL" β the foundation. But how you build it determines everything that follows.
Core Data Structures
Primary storage: Hash map (PK β Row)
Why hash map over B-tree for primary index?
Hash map
O(1)
O(n) β must scan all
O(1) amortized
~64B/entry
B-tree
O(log n)
O(log n + k)
O(log n)
~40B/entry
Skip list
O(log n)
O(log n + k)
O(log n)
~33B/entry
For an in-memory database where most access is by primary key, the hash map wins. If range queries on PK are needed, a B-tree or skip list is better. SingleStore uses lock-free skip lists for this reason β they support both point and range queries without lock contention.
Pseudocode: Core Operations
When Phase 1 works: Small datasets (< RAM), single-threaded access, simple key-value lookups.
When Phase 1 fails: The moment you need SELECT * WHERE status = 'active' β every query is a full table scan.
4. Why Naive Fails (The Math)
Problem 1: Query Performance Without Indexes
Problem 2: Concurrency Bottleneck
Problem 3: The Memory Wall
Query on non-PK column
Full table scan O(n)
Inverted index O(1) lookup + O(k) fetch
Concurrent access
Global lock (40K ops/sec)
Partition-serial or MVCC (millions ops/sec)
Data exceeds memory
Crash or OOM-kill
Tiered storage with buffer pool
Persistence
None (data lost on restart)
WAL + periodic snapshots
The tipping point: A single hash map works for a LeetCode exercise. Production needs indexes for queries, a concurrency strategy for throughput, and a tiered storage layer for growth.
5. Phase 2+: Distributed Architecture
The key architectural insight: Separate the storage engine (how data is physically stored and accessed) from the query engine (how queries are parsed, planned, and executed), and design the storage engine interface so it can transparently tier between memory and disk.
How Real Companies Built This
Redis β The Single-Threaded Champion
Redis sidesteps concurrency entirely: one thread processes all commands sequentially. This sounds limiting, but a single Redis process handles 100Kβ200K ops/sec because each operation is a simple in-memory data structure manipulation taking microseconds. For more throughput, you run multiple Redis processes (Redis Cluster) with 16,384 hash slots distributed across shards.
Redis's persistence model offers two options:
RDB snapshots: Fork the process, child writes memory to disk via copy-on-write. Fast recovery but loses data since last snapshot.
AOF (Append-Only File): Log every write command. Durable but slower recovery (replay all commands).
Redis Enterprise recommends max 25GB per process and multiple processes per VM, carefully aligned to NUMA boundaries to avoid cross-socket memory access penalties.
War story: Coinbase experienced a 5-hour Redis outage during a period of 50x traffic growth. The root cause was connection pool exhaustion β their application opened connections faster than Redis could process them, creating a stampede that took down the entire cluster. The lesson: connection pooling and backpressure are as important as the database itself.
VoltDB β Partition-Serial, Zero Locking
VoltDB takes the most radical approach to concurrency: eliminate it entirely. Data is partitioned into shards, and each shard is processed by a single thread. There are no locks, no latches, no MVCC overhead. A single partition runs stored procedures sequentially, achieving millions of transactions per second per node because every operation is a direct memory access with zero coordination.
The trade-off: cross-partition transactions require two-phase coordination and are significantly more expensive. VoltDB's design forces you to think about your data partitioning upfront β the partition key is the most important schema decision.
SingleStore (MemSQL) β Lock-Free Skip Lists + Code Generation
SingleStore compiles SQL queries to machine code via LLVM at CREATE TABLE time. Their rowstore uses lock-free skip lists as the primary index, allowing concurrent readers and writers without blocking. For memory-exceeding workloads, their columnstore compresses data with dictionary and run-length encoding, achieving 7x+ compression ratios.
H-Store Anti-Caching β Inverting the Buffer Pool
Carnegie Mellon's H-Store project pioneered "anti-caching" β the inverse of traditional databases. Instead of "data lives on disk, cache hot data in memory," anti-caching says "data lives in memory, evict cold data to disk." When a transaction accesses evicted data, it aborts, the data is fetched from disk asynchronously, and the transaction retries. This achieved 9-10x throughput over MySQL for larger-than-memory workloads with skewed access patterns.
Staff+ Signal: Redis's single-threaded model and VoltDB's partition-serial model are both solutions to the same insight: in-memory operations are so fast (microseconds) that the overhead of lock acquisition (even uncontended mutexes cost ~25ns) and cache-line bouncing between CPU cores dominates actual computation time. The staff+ answer isn't "add more threads" β it's "eliminate coordination overhead entirely, then scale out horizontally." This is why Redis at 100K ops/sec per thread often beats multi-threaded databases at aggregate throughput.
Key Data Structure: The Inverted Index
The inverted index is the answer to Part 2 of the interview β efficient querying on non-primary-key columns.
Posting list storage strategies:
Sorted array
Compact
O(n+m) merge
Static data, low update frequency
Roaring bitmap
Very compact for dense sets
O(min) bitwise AND
Low-cardinality columns (status, country)
Skip list
Moderate
O(n log m)
High update frequency
Hash set
Largest
O(min) lookup
Small posting lists
RediSearch uses compressed inverted indexes with delta encoding and roaring bitmaps, achieving sub-millisecond full-text search across millions of documents.
6. Core Component Deep Dives
6.1 Buffer Pool Manager (The Heart of Part 3)
When data exceeds memory, the buffer pool is the gatekeeper between fast RAM and slow disk. This is the answer to Part 3 of the interview.
Responsibilities:
Manage a fixed-size pool of memory pages (frames)
Decide which pages stay in memory and which get evicted
Handle page faults: fetch from disk when a requested page isn't in memory
Track dirty pages and flush them to disk
Page layout:
Staff+ Signal: MySQL InnoDB's buffer pool uses a midpoint insertion LRU β new pages enter at the 3/8 mark rather than the head. This prevents full-table scans from flushing the entire buffer pool. A sequential scan reads every page once, and without midpoint insertion, those one-time-access pages would push out genuinely hot data. PostgreSQL solves the same problem differently: a 256KB "ring buffer" isolates sequential scans from polluting shared buffers entirely. The choice between these strategies depends on your workload mix β midpoint insertion is simpler, ring buffers are more effective but require the query planner to classify scan types.
6.2 Eviction Policy Engine (The Heart of Part 4)
Part 4 asks for two things: (1) keep recently-accessed data in memory (eviction policy), and (2) preload frequently-accessed data (prefetching/warming).
Eviction Policy Comparison:
LRU
Good for recency
O(1) with doubly-linked list + hash map
None β one access pattern
Memcached
LFU
Good for frequency
O(log n) naive, O(1) with bucket sort
None β can't forget old popular items
Redis (with decay)
ARC
Near-optimal
O(1) but 2x metadata (ghost entries)
Self-tuning recency/frequency balance
ZFS, IBM DS8000
W-TinyLFU
Near-optimal
O(1), 8 bytes/entry for sketch
Hill climbing for window size
Caffeine (Java)
Clock
Good (LRU approximation)
O(1), no linked list overhead
None
PostgreSQL, Linux page cache
W-TinyLFU β the state of the art:
The key insight: W-TinyLFU combines recency (window LRU catches burst popularity) with frequency (main cache retains consistently popular items) and uses a probabilistic frequency sketch (CountMinSketch) that costs only 8 bytes per tracked item β far cheaper than ARC's ghost entries.
Preloading Strategy (Part 4 continued):
Staff+ Signal: The preloading strategy matters most during cold starts β and cold starts happen more often than people think (deploys, failovers, scaling events). An unthrottled cache warming strategy is effectively a self-inflicted DDoS against your backing store. Production systems like InnoDB solve this by dumping the buffer pool page list (not the data) to disk on shutdown, then loading those specific pages on startup at a controlled rate. Redis does the same with RDB snapshots. The staff+ insight: cache warming latency directly determines your failover recovery time, which directly determines your availability SLO.
6.3 Concurrency Control
Three production-proven approaches:
Approach 1: Single-threaded event loop (Redis)
Approach 2: Partition-serial (VoltDB)
Approach 3: MVCC (Hekaton / most traditional DBs)
Single-threaded
100-200K ops/sec
Lowest
Free (all serial)
Cache, simple KV
Partition-serial
Millions ops/sec
Medium
Expensive (2PC)
OLTP with partition key
MVCC
High (readers free)
Highest
Natural (snapshot isolation)
General-purpose RDBMS
7. The Scaling Journey
Stage 1: Single Process, All In-Memory (0β25GB)
Limit: Single machine RAM (64-512GB). No high availability. Process crash = downtime until WAL replay.
Stage 2: Tiered Storage, Single Node (25GBβ500GB)
New capabilities:
Buffer pool manages hot/cold data transparently
W-TinyLFU or ARC eviction keeps working set in memory
Indexes remain in memory (they're small relative to data)
Background flush writes dirty pages to SSD
Limit: Single-node CPU and I/O bandwidth. No horizontal scaling. SSD wear from write amplification.
Stage 3: Sharded Cluster (500GBβ10TB+)
New capabilities:
Hash-slot-based sharding (Redis: CRC16 mod 16384)
Each shard has tiered storage independently
Replicas for read scaling and high availability
Online resharding: migrate slot ranges between nodes
Staff+ Signal: At the sharded stage, the inverted index becomes a distributed problem. Option A: each shard maintains its own local inverted index (queries on non-partition-key columns scatter to all shards). Option B: a separate global index service. Option A is operationally simpler but every non-PK query hits every shard. Option B reduces scatter but adds a service to operate and keep consistent. Twitter chose Option A for their 10,000+ Redis instances because the operational burden of a global index service outweighed the scatter-gather cost. The right choice depends on your query patterns β if 90% of queries include the partition key, local indexes win.
8. Failure Modes & Resilience
Request Flow with Failure Handling
Failure Scenarios
Process crash
Heartbeat timeout (3s)
Restart + WAL replay
Single shard downtime
Lose last ~1s of async-replicated writes
OOM kill
Kernel OOM event
Restart with lower maxmemory; eviction policy kicks in
Single shard
No data loss if WAL enabled
Disk failure
I/O errors
Promote replica; rebuild from snapshot
Single shard write path
Lose data since last snapshot if no replica
Network partition
Split-brain detection
Quorum-based leader election; minority side rejects writes
Partitioned shards
Risk of divergent writes without quorum
Hot key thundering herd
Latency spike on single shard
Local caching at client; key splitting
Single shard, all queries for that key
No data loss, but availability degraded
Memory fragmentation
Increasing RSS with stable key count
Restart with jemalloc defrag; or activedefrag in Redis
Gradual throughput degradation
No data loss
Staff+ Signal: The most insidious failure in an in-memory database is memory fragmentation. After millions of allocations and deallocations of varying sizes, physical memory becomes fragmented even though there's enough total free space. RSS grows while useful data size stays flat. Redis addresses this with
activedefrag(background jemalloc defragmentation), but it competes with command processing for CPU. The staff+ insight: your eviction policy math (e.g., "we have 64GB for 50GB of data, so 14GB headroom") must account for fragmentation overhead β real headroom is more like 20-30% of useful data size, not the naive difference.
Persistence and Durability
9. Data Model & Storage
Core Tables (Metadata)
In-Memory Page Layout
Storage Engine Interface
Storage Engine Choice
Hot data (working set)
In-memory pages in buffer pool
Microsecond access, the whole point
Cold data
NVMe SSD pages
100ΞΌs access, 100x cheaper than RAM per GB
WAL
Append-only file on SSD
Sequential writes, fsync for durability
Snapshots
Object storage (S3)
Cheap, durable, survives host failure
Metadata catalog
In-memory + WAL
Small, must be always available
Indexes
Always in-memory
Indexes are small relative to data; evicting them destroys query performance
10. Observability & Operations
Key Metrics
Memory:
buffer_pool_hit_ratioβ percentage of page requests served from memory. SLO: > 95%. Below 90% means your working set exceeds buffer pool. This is the single most important metric.buffer_pool_dirty_pages_ratioβ percentage of dirty pages. If > 80%, flush I/O may bottleneck.memory_fragmentation_ratioβ RSS / used_memory. Alert if > 1.5 (50% waste from fragmentation).eviction_rate_per_secβ pages evicted per second. Sustained high rate = working set doesn't fit.
Latency:
read_latency_us{tier="memory|disk"}β separating memory hits from disk fetches reveals true performance.write_latency_usβ includes WAL fsync time. Alert if p99 > 1ms.index_lookup_latency_us{type="primary|inverted"}β ensure indexes stay fast.
Throughput:
operations_per_second{type="read|write"}β overall throughput by operation type.page_faults_per_secondβ disk fetches. Correlates directly with disk I/O saturation.
Distributed Tracing: Indexed Query
Alerting Strategy
Buffer pool hit ratio drop
< 90% for 10min
P2
Working set grew β increase buffer pool size or add shard
Memory usage critical
> 90% of maxmemory
P1
Eviction kicking in β scale up RAM or start evicting aggressively
WAL disk near full
< 10% disk space
P1
WAL compaction stuck or snapshots not cleaning up; risk of write halt
Replication lag
> 5 seconds
P2
Replica falling behind β failover would lose data
Hot key detected
Single key > 1% of total ops
P3
Client-side caching or key splitting needed
OOM kill
Kernel event
P1
Immediate: promote replica. Investigate: maxmemory misconfigured or memory leak
11. Design Trade-offs
Concurrency model
Single-threaded (Redis)
Partition-serial (VoltDB)
Partition-serial for DB workloads
Single-threaded is simpler but caps at ~200K ops/sec. Partition-serial scales linearly with cores while keeping zero-lock simplicity per partition.
Eviction policy
LRU
W-TinyLFU
W-TinyLFU
Near-optimal hit ratios with minimal memory overhead. LRU is vulnerable to scan pollution. ARC is patented by IBM.
Inverted index storage
Sorted arrays
Roaring bitmaps
Hybrid
Roaring bitmaps for low-cardinality columns (status, country). Sorted arrays for high-cardinality (email, user_id).
Persistence
WAL only
Snapshot + WAL
Snapshot + WAL
Snapshot for fast recovery; WAL for minimizing data loss since last snapshot. Redis's hybrid approach is battle-tested.
Index placement (distributed)
Local per-shard
Global index service
Local per-shard
Operationally simpler. Scatter-gather on non-PK queries is acceptable if most queries include the partition key.
When data exceeds memory
Shard to more nodes
Tier to SSD
Tier first, shard second
Tiering is a one-machine change. Sharding adds operational complexity (split-brain, rebalancing, cross-shard queries). Tier when you can; shard when you must. Two-way door.
Snapshot mechanism
fork() + COW
Versioned no-fork
fork() for simplicity
fork() is well-understood and works. No-fork (Dragonfly) avoids memory doubling but adds implementation complexity. Start with fork(); switch if memory pressure justifies it.
Staff+ Signal: "Tier first, shard second" is a key architectural principle. Adding tiered storage to a single node is a local change β your API, query engine, and index layer don't change. Sharding is a distributed systems change β you now have cross-shard queries, distributed transactions, rebalancing during growth, and split-brain during network partitions. A single node with 64GB RAM + 1TB NVMe SSD can handle far more data than people expect when the access pattern is skewed (and most real-world access patterns are). Instagram stored 300M mappings in <5GB by optimizing Redis data structures. Twitter runs 105TB across 10,000+ Redis instances β but most of those are sharded for throughput, not data size.
12. Common Interview Mistakes
Jumping to distributed systems before solving single-node: "I'd use a Redis cluster with consistent hashing." Stop. First design the data structures. Show you understand what's inside each node before you distribute it. What staff+ candidates say: "Let me start with a single-node in-memory store, prove where it breaks with math, then distribute."
Using a global lock for concurrency: "I'd add a mutex around the hash map." This destroys the entire latency advantage of being in-memory. A mutex acquisition (~25ns uncontended) is the same order of magnitude as a hash map lookup (~50ns). What staff+ candidates say: "For an in-memory DB, lock overhead dominates compute time. I'd use partition-serial execution β each partition is single-threaded, eliminating locks entirely."
Forgetting index maintenance cost on writes: "I'd add an inverted index on every column." Each index means every write does an additional insert into the index data structure. 5 indexes = 6x write amplification. What staff+ candidates say: "Indexes trade write performance for read performance. I'd index selectively based on query patterns, and I'd account for the write amplification in my capacity planning."
Saying 'just add more RAM' when data exceeds memory: This isn't a strategy β it's a credit card. RAM costs ~$5/GB/month; SSD costs ~$0.10/GB/month. At 1TB, that's $5,000/month vs. $100/month. What staff+ candidates say: "I'd implement tiered storage with a buffer pool. Indexes stay in memory, hot data pages stay in memory, and cold pages are evicted to SSD. The key is the eviction policy β W-TinyLFU gives near-optimal hit ratios."
Designing the inverted index without considering intersection: Building an inverted index is the easy part. The hard part is multi-column queries that require intersecting posting lists efficiently. What staff+ candidates say: "For
WHERE city = X AND status = Y, I intersect the posting lists. I'd use sorted arrays for this β merge intersection is O(n+m). For low-cardinality columns like status, roaring bitmaps give me O(1) bitwise AND."No persistence story: "It's in-memory, so no disk." Then your entire database disappears on restart. What staff+ candidates say: "I'd use a write-ahead log for durability β every write is appended to disk before being acknowledged. Periodic snapshots reduce recovery time. The trade-off is WAL fsync frequency: every write (durable but slower) vs. every second (faster but up to 1s data loss)."
Ignoring memory fragmentation: Calculating capacity as "64GB RAM / row size = max rows." Real memory usage is 20-30% higher due to allocator overhead and fragmentation. What staff+ candidates say: "I'd budget 70% of RAM for data, 10% for indexes, and 20% for fragmentation headroom and OS overhead. Redis uses jemalloc with active defragmentation to manage this."
13. Interview Cheat Sheet
Time Allocation (45-minute interview)
Clarify requirements
3 min
Scope: KV store vs. SQL-like DB? Read vs. write heavy? Data size vs. RAM?
Part 1: Core data structures
8 min
Hash map for primary index, row storage, API (CRUD + query)
Part 2: Inverted index
8 min
Index structure, posting lists, multi-column intersection, write amplification
Part 3: Exceeding memory
10 min
Buffer pool, page-based storage, eviction policies, tiered architecture
Part 4: Hot/frequent data
8 min
W-TinyLFU, preloading strategies, cache warming on cold start
Concurrency + persistence
5 min
Concurrency model choice, WAL + snapshots
Trade-offs + wrap-up
3 min
Key decisions, failure modes, what you'd do differently at 10x scale
Step-by-Step Answer Guide
Clarify: "Is this a key-value store like Redis or a relational store with SQL-like queries? What's the expected data size relative to available memory? Read-heavy or write-heavy?"
Part 1 β Data structures: "I'd use a hash map for primary key access (O(1) lookup). Rows stored as column-order arrays. A table catalog maps table names to their schema and data. For the LeetCode version, this is sufficient."
Part 2 β Inverted index: "For querying non-PK columns, I'd add an inverted index: a hash map from column value β posting list of row IDs. For multi-column queries, I intersect posting lists. I'd use roaring bitmaps for low-cardinality columns and sorted arrays for high-cardinality. Write amplification: each indexed column adds one index update per write."
Part 3 β Exceeding memory: "I'd introduce a buffer pool that manages fixed-size pages. Hot pages stay in RAM; cold pages are evicted to SSD. The storage engine interface is transparent β the query layer doesn't know if a page is in memory or on disk. Indexes stay in memory always (they're small relative to data)."
Part 4 β Hot data + preloading: "For eviction, I'd use W-TinyLFU: it combines recency (window LRU) with frequency (CountMinSketch) for near-optimal hit ratios. For preloading, I'd dump the buffer pool page list on shutdown and reload those pages on startup. For predictive preloading, I'd track access patterns and prefetch correlated pages."
Concurrency: "In-memory operations are so fast that lock overhead dominates. I'd use partition-serial execution β each data partition has a single thread, zero locks. Scale by adding partitions. Cross-partition transactions use 2PC but should be rare with good partitioning."
Persistence: "Hybrid: WAL for durability (append every write, fsync per second) + periodic snapshots for fast recovery. Recovery: load latest snapshot, replay WAL from that point."
Failure modes: "OOM: eviction policy as backpressure, degrade to disk reads. Process crash: promote replica, replay WAL. Memory fragmentation: jemalloc with active defrag, budget 20% headroom."
Trade-offs: "I optimized for read latency over write throughput (indexes add write cost). I chose tiered storage over sharding first because it's a local change. I'd shard when single-node I/O becomes the bottleneck."
What the Interviewer Wants to Hear
At L5/Senior: Parts 1-2 done well. Hash map for primary index, inverted index for queries, basic concurrency awareness.
At L6/Staff: Parts 1-3 done well, Part 4 started. Buffer pool design, eviction policy choice with reasoning, persistence story, concurrency model justification citing real systems (Redis, VoltDB).
At L7/Principal: All 4 parts with depth. Quantified trade-offs (W-TinyLFU vs. ARC vs. LRU with hit-ratio reasoning), production war stories (Coinbase connection stampede, Instagram memory optimization), organizational insight ("tier first, shard second" as an operational complexity argument), and migration path from each stage to the next.
Written by Michi Meow as a reference for staff-level system design interviews. The in-memory database is the rare interview question that starts at LeetCode difficulty and ends at distributed systems difficulty β all in 45 minutes. The candidates who ace it are the ones who see all four parts as one evolving system, not four separate questions.
Last updated