Speculative Decoding: How to Make LLMs 2-3x Faster Without Losing Quality
A comprehensive guide to speculative decoding, the technique that accelerates LLM inference by 2-3x while maintaining identical output quality. Learn how draft-then-verify works, the math behind accep
The Problem: Why LLMs Are Slow
Input: "The cat sat on the"
β (forward pass 1)
"mat"
β (forward pass 2)
"and"
β (forward pass 3)
"purred"
...The Latency Problem
The Solution: Speculative Decoding
The Core Idea
Why This Works
A Simple Analogy: The Editor and the Intern
How It Works: Step by Step
Step 1: Draft Token Generation
Step 2: Parallel Verification
Step 3: Accept or Reject (Rejection Sampling)
Step 4: Continue
The Math: Why It's Lossless
Rejection Sampling Explained
Why This Preserves the Distribution
Acceptance Rate and Speedup
Acceptance Rate
Expected Tokens
Effective Speedup
Real-World Implementations
Hugging Face Transformers (Assisted Generation)
vLLM Implementation
N-gram Prompt Lookup: A Clever Trick
Performance Benchmarks
Model Pair
Task
Speedup
Factors Affecting Speedup
Trade-offs and Limitations
Memory Overhead
When Speculative Decoding Doesn't Help
Draft Model Selection Tips
Target Model
Good Draft Model
Notes
Advanced Techniques
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)
Medusa: Multiple Heads
Self-Speculative Decoding
The Research Papers
Google: "Fast Inference from Transformers via Speculative Decoding"
DeepMind: "Accelerating LLM Decoding with Speculative Sampling"
Practical Recommendations
When to Use Speculative Decoding
Quick Start Checklist
Conclusion
Key Takeaways
References
PreviousPi: The Minimal Agent Philosophy β How Less Becomes MoreNextWe're All Addicted To Claude Code
Last updated