PagedAttention: How Virtual Memory Revolutionized LLM Inference
A deep dive into PagedAttention, the breakthrough memory management technique that enables efficient LLM serving. Learn how borrowing ideas from OS virtual memory solved the KV cache memory problem an
TL;DR
Aspect
Traditional KV Cache
PagedAttention
Memory waste
60-80%
<4%
Allocation
Pre-allocated, contiguous
On-demand, non-contiguous
Fragmentation
High (internal + external)
Near-zero
Throughput
Baseline
2-4x improvement
Memory sharing
Not supported
Copy-on-write enabled
alt text
The Problem: Why LLM Serving Is Memory-Hungry
When you ask an LLM to generate text, something interesting happens behind the scenes. The model doesn't just process your prompt onceโit needs to remember what it has already seen to generate each new token coherently.
What Is the KV Cache?
In transformer models, the self-attention mechanism computes Key and Value vectors for every token. During text generation, these vectors are cached to avoid redundant computation:
The Memory Math
For a model like LLaMA-13B, the KV cache for a single token requires:
For a 2048-token sequence: ~3.2 GB just for KV cache per request!
The Allocation Nightmare
Traditional systems pre-allocate memory for the maximum possible sequence length:
Two Types of Fragmentation
1. Internal Fragmentation โ wasted space within allocated blocks
2. External Fragmentation โ unusable gaps between allocated blocks
[!WARNING] Studies show that existing LLM serving systems waste 60-80% of KV cache memory due to fragmentation. This directly limits how many requests can be batched together, reducing throughput.
alt text
The Solution: PagedAttention
PagedAttention is a memory management technique introduced in the vLLM paper (SOSP 2023) that borrows ideas from operating system virtual memory to solve the KV cache problem.
The Key Insight: OS Virtual Memory
In operating systems, programs don't directly access physical memory. Instead:
Programs use virtual addresses (logical)
The OS maps these to physical addresses using a page table
Physical memory is divided into fixed-size pages (typically 4KB)
Pages can be non-contiguous in physical memory
PagedAttention: The Same Idea for KV Cache
PagedAttention applies this concept to KV cache memory:
KV Cache = Virtual Memory: Each request has a "logical" view of its KV cache
Blocks = Pages: KV cache is divided into fixed-size blocks (e.g., 16 tokens each)
Memory sharing amplifies benefits: Shared prefixes and copy-on-write make parallel sampling highly efficient
Real-world impact is significant: 2-4x throughput improvement with near-zero memory waste
[!IMPORTANT] PagedAttention is now the industry standard for LLM serving. If you're deploying LLMs in production, use a serving framework that implements it (vLLM, TensorRT-LLM, etc.).
References
Kwon, W., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP '23. Paper
Traditional System (can only batch 2 requests):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GPU Memory โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Request A (pre-alloc)โ โ Request B (pre-alloc)โ โ
โ โ 50% capacity each โ โ 50% capacity each โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โ No room for Request C! โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PagedAttention (can batch 8 requests):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GPU Memory โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โA0โโB0โโC0โโD0โโA1โโB1โโE0โโF0โโC1โโG0โโH0โโA2โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โ 8 concurrent requests with dynamic allocation! โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
from vllm import LLM, SamplingParams
# vLLM uses PagedAttention by default
llm = LLM(
model="meta-llama/Llama-2-13b-hf",
# PagedAttention-related settings
block_size=16, # Tokens per block (default: 16)
gpu_memory_utilization=0.90, # Use 90% of GPU memory
max_num_seqs=256, # Max concurrent sequences
max_num_batched_tokens=4096, # Max tokens per iteration
)
# Multiple prompts automatically benefit from PagedAttention
prompts = [
"Explain quantum computing in simple terms",
"Write a haiku about artificial intelligence",
"What are the benefits of renewable energy?",
"How does machine learning differ from AI?",
]
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
)
# Efficient batch processing with memory sharing
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Response: {output.outputs[0].text[:100]}...")
print()
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-13b-hf")
# Same prompt, multiple responses
# PagedAttention shares the prompt's KV cache across all samples
prompt = "Write a creative story about a robot learning to paint:"
sampling_params = SamplingParams(
n=5, # Generate 5 different responses
temperature=0.9, # More creative
max_tokens=200,
)
# Memory efficient! Prompt KV cache is shared, not duplicated 5x
outputs = llm.generate([prompt], sampling_params)
for i, output in enumerate(outputs[0].outputs):
print(f"=== Response {i+1} ===")
print(output.text[:200])
print()
# Start vLLM server with optimal PagedAttention settings
vllm serve meta-llama/Llama-2-13b-hf \
--block-size 16 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 256 \
--enable-prefix-caching # Share KV cache for common prefixes
# Client code (OpenAI-compatible API)
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="token",
)
# All requests automatically benefit from PagedAttention
response = client.chat.completions.create(
model="meta-llama/Llama-2-13b-hf",
messages=[
{"role": "user", "content": "Explain PagedAttention in one sentence"}
],
max_tokens=100,
)
print(response.choices[0].message.content)