Multi-Token Prediction (MTP)
Multi-Token Prediction (MTP) is a technique in large-language-model (LLM) inference where the model predicts multiple future tokens simultaneously rather than autoregressively one-by-one. This approach reduces latency by allowing parallel processing of potential next tokens, often integrated with speculative-decoding to verify these predictions efficiently.
Key Characteristics
- Parallelism: Predicts a sequence of tokens in a single forward pass.
- Verification: Requires a target model to verify the proposed sequence, accepting or rejecting tokens based on probability thresholds.
- Efficiency: Significantly boosts throughput (tokens/second) when the acceptance rate is high.
Integration with Speculative Decoding
MTP is frequently combined with other speculative decoding strategies to maximize inference speed:
- Ngram Stacking: Combines MTP with simple N-gram lookups to quickly propose likely token sequences based on historical patterns.
- Implementation: Tools like llamacpp support stacking these methods to leverage both neural prediction and heuristic speedups.
Related Resources
- MTP + Ngram Stacked Speculative Decoding in Llama.cpp for LLM Inference: Case study demonstrating Qwen3.6 27B achieving 56 tok/s locally via stacked MTP and N-gram decoding.