Speculative decoding
Speculative decoding is an inference acceleration technique for Autoregressive generation models that reduces latency by using a smaller Draft model to propose tokens, verified in parallel by a larger Target model.
Core Mechanism
- Drafting: A computationally cheap model generates a sequence of candidate tokens.
- Parallel Verification: The target model processes all candidates in a single forward pass, evaluating likelihoods simultaneously.
- Acceptance/Rejection: Tokens are accepted if the target distribution matches sufficiently; rejection triggers sampling from the target at the first discrepancy.
- Compute Trade-off: Minimizes expensive target model calls, improving throughput without compromising output quality.
Key Variants
- Multi-Token Prediction (MTP): Drafters emit multiple tokens per step, increasing acceptance windows and speedup factors.
- Self-speculative decoding: Utilizes internal states or cached predictions of the target model for drafting.
- Early-exit speculative decoding: Uses intermediate layers of the target model to approximate drafting.
- Stacked/Ngram Hybrid: Combines MTP with n-gram lookups to maximize draft acceptance rates, particularly effective in local inference engines like llamacpp.
Implementations & Resources
- Google Gemma-4 MTP Drafters
- MTP + Ngram Stacked Speculative Decoding in Llama.cpp for LLM Inference: Demonstrates stacking MTP with n-gram predictions in llamacpp to achieve significant token-per-second improvements (e.g., 56 tok/s on Qwen3.6 27B).