PagedAttention: How Virtual Memory Revolutionized LLM Inference

A deep dive into PagedAttention, the breakthrough memory management technique that enables efficient LLM serving. Learn how borrowing ideas from OS virtual memory solved the KV cache memory problem an

TL;DR

Aspect
Traditional KV Cache
PagedAttention

Memory waste

60-80%

<4%

Allocation

Pre-allocated, contiguous

On-demand, non-contiguous

Fragmentation

High (internal + external)

Near-zero

Throughput

Baseline

2-4x improvement

Memory sharing

Not supported

Copy-on-write enabled

alt text

The Problem: Why LLM Serving Is Memory-Hungry

When you ask an LLM to generate text, something interesting happens behind the scenes. The model doesn't just process your prompt onceโ€”it needs to remember what it has already seen to generate each new token coherently.

What Is the KV Cache?

In transformer models, the self-attention mechanism computes Key and Value vectors for every token. During text generation, these vectors are cached to avoid redundant computation:

The Memory Math

For a model like LLaMA-13B, the KV cache for a single token requires:

For a 2048-token sequence: ~3.2 GB just for KV cache per request!

The Allocation Nightmare

Traditional systems pre-allocate memory for the maximum possible sequence length:

Two Types of Fragmentation

1. Internal Fragmentation โ€” wasted space within allocated blocks

2. External Fragmentation โ€” unusable gaps between allocated blocks

[!WARNING] Studies show that existing LLM serving systems waste 60-80% of KV cache memory due to fragmentation. This directly limits how many requests can be batched together, reducing throughput.


alt text

The Solution: PagedAttention

PagedAttention is a memory management technique introduced in the vLLM paper (SOSP 2023)arrow-up-right that borrows ideas from operating system virtual memory to solve the KV cache problem.

The Key Insight: OS Virtual Memory

In operating systems, programs don't directly access physical memory. Instead:

  1. Programs use virtual addresses (logical)

  2. The OS maps these to physical addresses using a page table

  3. Physical memory is divided into fixed-size pages (typically 4KB)

  4. Pages can be non-contiguous in physical memory

PagedAttention: The Same Idea for KV Cache

PagedAttention applies this concept to KV cache memory:

  1. KV Cache = Virtual Memory: Each request has a "logical" view of its KV cache

  2. Blocks = Pages: KV cache is divided into fixed-size blocks (e.g., 16 tokens each)

  3. Block Table = Page Table: Maps logical blocks to physical GPU memory locations

  4. On-Demand Allocation: Blocks are allocated only when needed


How It Works: Step-by-Step

Step 1: Initial Prompt Processing

When a request arrives, PagedAttention allocates blocks on-demand as prompts are processed:

Step 2: Token Generation

As new tokens are generated, blocks are allocated only when the current block fills up:

Step 3: Multiple Requests with Non-Contiguous Allocation

The magic happens when multiple requests share GPU memory:


Memory Sharing: The Copy-on-Write Advantage

PagedAttention enables efficient memory sharing for advanced use cases like parallel sampling and beam search.

Scenario: Parallel Sampling

When generating multiple responses from the same prompt:

Copy-on-Write Mechanism

When a shared block needs modification, it's copied first:


Performance Benefits

Memory Efficiency Comparison

Throughput Improvement

Real-world benchmarks from the vLLM paper show significant improvements:

Model
Sequence Length
Improvement vs FasterTransformer

OPT-13B

512

2.2x

OPT-13B

2048

4.3x

LLaMA-13B

512

2.4x

LLaMA-13B

2048

3.8x

[!TIP] The improvement is more pronounced with longer sequences because traditional systems waste more memory with larger max-length allocations.

Why Higher Throughput?


Practical Examples

Using vLLM (PagedAttention Built-in)

Parallel Sampling with Shared Prefixes

API Server Configuration


Comparison with Traditional Approaches

Feature
Static Allocation
Chunked Attention
PagedAttention

Memory allocation

Pre-allocated max

Chunked prefill

On-demand blocks

Fragmentation

High

Medium

Near-zero

Memory sharing

None

Limited

Full (CoW)

Throughput

Baseline

1.2-1.5x

2-4x

Long sequences

Poor

Better

Excellent

Implementation

Simple

Medium

Complex


Key Takeaways

  1. The KV cache is memory-hungry: Each token requires ~1MB+ of memory for large models

  2. Traditional allocation wastes 60-80% of memory: Pre-allocation and fragmentation severely limit batch sizes

  3. PagedAttention borrows from OS concepts: Virtual memory, paging, and copy-on-write solve the memory efficiency problem

  4. Block tables enable flexible allocation: Logical-to-physical mapping allows non-contiguous, on-demand memory usage

  5. Memory sharing amplifies benefits: Shared prefixes and copy-on-write make parallel sampling highly efficient

  6. Real-world impact is significant: 2-4x throughput improvement with near-zero memory waste

[!IMPORTANT] PagedAttention is now the industry standard for LLM serving. If you're deploying LLMs in production, use a serving framework that implements it (vLLM, TensorRT-LLM, etc.).


References

  1. Kwon, W., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP '23. Paperarrow-up-right

  2. Yu, G., et al. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI '22. Paperarrow-up-right


Last updated: February 2026

Last updated