Token Generation Speed
Token Generation Speed (often measured as tokens per second, tps) is the metric defining how rapidly a Large Language Model (llm) produces output during autoregressive inference. It is a primary bottleneck in user experience and system throughput.
Key Determinants
- Hardware Acceleration: GPU VRAM bandwidth and compute units (e.g., Tensor Cores) heavily dictate speed.
- Model Architecture: Context length, parameter count, and attention mechanisms (inference-optimization) impact latency.
- Quantization: Using lower precision formats (e.g., GGUF Q4_0) reduces VRAM usage and can increase throughput, albeit with potential accuracy trade-offs.
- Prompt Processing: The “prefill” phase speed versus the “decoding” phase speed.
Optimization Techniques
- Speculative Decoding: speculative-decoding allows the model to predict multiple tokens in parallel before verification, effectively increasing throughput.
- Multi-Token Prediction (MTP): Recent advancements in llamacpp include native support for Multi-Token Prediction, where the model is trained to predict subsequent tokens simultaneously rather than strictly sequentially. This reduces the number of forward passes required for a given output length.
- See: Llama.cpp Multi-Token Prediction: Faster Local LLM Inference Explained for details on the 2x speedup potential and implementation specifics.
- Batching: Increasing batch size for non-interactive workloads improves GPU utilization.
References
- llamacpp documentation on MTP implementation.
- Tim Carambat’s analysis of MTP integration in 2026.