Speculative Decoding: How to Make LLMs 2-3x Faster Without Losing Quality
A comprehensive guide to speculative decoding, the technique that accelerates LLM inference by 2-3x while maintaining identical output quality. Learn how draft-then-verify works, the math behind accep
The Problem: Why LLMs Are Slow
Large Language Models (LLMs) generate text one token at a time through a process called autoregressive decoding. Each token requires a full forward pass through billions of parameters, and each pass must wait for the previous one to complete.
Input: "The cat sat on the"
โ (forward pass 1)
"mat"
โ (forward pass 2)
"and"
โ (forward pass 3)
"purred"
...
This sequential nature creates a fundamental bottleneck: generating 100 tokens requires 100 forward passes, regardless of how powerful your hardware is. The model is often "memory-bound" rather than "compute-bound" โ it spends more time loading weights than doing actual computation.
Generation speed: ~30-50 tokens/second on high-end hardware
User experience: Noticeable lag, especially for longer responses
What if we could generate multiple tokens in the time of one forward pass?
The Solution: Speculative Decoding
Speculative decoding (also called "speculative sampling") is a technique that uses a smaller, faster "draft" model to predict what the larger "target" model would say, then verifies those predictions in parallel.
The key insight: verification is cheaper than generation.
The Core Idea
Why This Works
Parallel verification: The target model can compute probabilities for ALL draft tokens simultaneously in a single forward pass
KV cache reuse: Once tokens are verified, their KV cache entries are kept for the next iteration
Lossless quality: The final output distribution is mathematically identical to standard decoding
A Simple Analogy: The Editor and the Intern
Imagine a busy editor (the target model) who must review every word in a document. Instead of writing each word themselves:
The intern (draft model) writes a rough draft quickly
The editor reviews the entire draft at once
If the first 3 sentences are good, the editor accepts them
The editor corrects the first wrong sentence and discards everything after it
The intern writes a new draft starting from the correction
The editor saves time because reviewing is faster than writing, and most of the intern's work is good enough to keep.
How It Works: Step by Step
Step 1: Draft Token Generation
The smaller draft model (e.g., 7B parameters) generates K candidate tokens autoregressively:
Step 2: Parallel Verification
The target model (e.g., 70B parameters) processes all draft tokens in a single forward pass:
Step 3: Accept or Reject (Rejection Sampling)
For each draft token, compare probabilities:
This is the mathematical magic that makes speculative decoding lossless.
Step 4: Continue
Accept the longest prefix of matching tokens, then:
If all K tokens accepted: Generate 1 bonus token from target model โ K+1 tokens total
If n tokens accepted (n < K): Use target model's token at position n+1
The Math: Why It's Lossless
Speculative decoding uses rejection sampling to ensure the output distribution matches what the target model would have produced alone.
Rejection Sampling Explained
For each draft token x with:
P_q(x) = draft model probability
P_p(x) = target model probability
The acceptance probability is:
Case 1: P_p(x) โฅ P_q(x)
Accept with probability 1
The draft model was "pessimistic" about this token
Case 2: P_p(x) < P_q(x)
Accept with probability P_p(x) / P_q(x)
If rejected: Sample from the "residual" distribution:
Why This Preserves the Distribution
The beauty of rejection sampling is that:
Accepted tokens come from the target distribution
Rejected tokens are resampled from a correction distribution
The combination exactly matches P_target
This guarantees identical outputs to standard decoding โ same quality, just faster.
Acceptance Rate and Speedup
The acceptance rate (ฮฑ) determines the speedup:
Acceptance Rate
Expected Tokens
Effective Speedup
50%
~2 tokens
~1.5x
70%
~3 tokens
~2.0x
80%
~4 tokens
~2.5x
90%
~5+ tokens
~3.0x
Real-World Implementations
Hugging Face Transformers (Assisted Generation)
Starting from transformers v4.45.0, speculative decoding is called "assisted generation":
Choose a draft model from the same family, ~1/10 the size
Set K = 4-6 draft tokens as a starting point
Use greedy decoding (temperature=0) for maximum speedup
Monitor acceptance rate โ aim for >70%
Profile memory โ ensure both models fit in VRAM
Conclusion
Speculative decoding represents a fundamental shift in how we think about LLM inference. By exploiting the asymmetry between generation (expensive) and verification (cheap), it achieves 2-3x speedups while maintaining identical output quality.
Key Takeaways
Draft-then-verify: Small model drafts, large model verifies in parallel
Lossless: Rejection sampling ensures identical output distribution
Practical: Supported by vLLM, Hugging Face, and production systems
Trade-offs: Memory overhead vs. latency reduction
As LLMs grow larger and inference costs dominate, speculative decoding will become an essential tool in every ML engineer's toolkit.
References
Leviathan, Y. et al. (2022). "Fast Inference from Transformers via Speculative Decoding." arXiv:2211.17192
Chen, C. et al. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv:2302.01318
# Pseudocode
draft_tokens = []
for i in range(K): # K = 4-8 typically
next_token = draft_model.generate(input + draft_tokens)
draft_tokens.append(next_token)
# Single forward pass computes probabilities for all positions
target_probs = target_model.forward(input + draft_tokens)
draft_probs = [p1, p2, p3, p4, ...] # Saved from draft phase
For token at position i with value x:
If P_target(x) >= P_draft(x):
โ Always ACCEPT (draft was conservative)
Else:
โ Accept with probability: P_target(x) / P_draft(x)
โ If rejected: Stop, sample new token from adjusted distribution
ฮฑ(x) = min(1, P_p(x) / P_q(x))
P_residual(x) โ max(0, P_p(x) - P_q(x))
Expected accepted tokens per round: ฯ = (1 - ฮฑ^(K+1)) / (1 - ฮฑ)
Where:
ฮฑ = average acceptance probability
K = number of draft tokens
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load target model
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")
# Load draft (assistant) model
assistant_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Generate with speculative decoding
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = target_model.generate(
**inputs,
assistant_model=assistant_model, # Enable speculative decoding
max_new_tokens=100,
do_sample=False # Greedy decoding
)
print(tokenizer.decode(outputs[0]))
from vllm import LLM, SamplingParams
# Method 1: Separate draft model
llm = LLM(
model="facebook/opt-6.7b",
speculative_model="facebook/opt-125m",
num_speculative_tokens=5,
)
# Method 2: N-gram prompt lookup (no extra model needed!)
llm = LLM(
model="facebook/opt-6.7b",
speculative_model="[ngram]", # Use prompt n-grams
num_speculative_tokens=5,
ngram_prompt_lookup_max=4,
)
sampling_params = SamplingParams(temperature=0, max_tokens=100)
output = llm.generate("The future of AI is", sampling_params)