🗂️ AI & Agents · View mindmap

Token Generation Speed

Token Generation Speed (often measured as tokens per second, tps) is the metric defining how rapidly a Large Language Model (llm) produces output during autoregressive inference. It is a primary bottleneck in user experience and system throughput.

Key Determinants

Hardware Acceleration: GPU VRAM bandwidth and compute units (e.g., Tensor Cores) heavily dictate speed.
Model Architecture: Context length, parameter count, and attention mechanisms influence computational load.
Speculative Decoding: Techniques that predict multiple tokens in parallel to reduce sequential dependency overhead.
- Multi-Token Prediction (MTP): Predicts multiple future tokens simultaneously, boosting throughput in models like Qwopus Coder.
- DeepSeek DSpark: A specialized speed layer introduced by DeepSeek that utilizes enhanced speculative decoding to significantly accelerate inference. As demonstrated with Qwen3 models, DSpark can effectively double generation speed by optimizing the verification and proposal phases of speculative decoding. See DeepSeek DSpark: LLM Inference Acceleration via Enhanced Speculative Decoding for detailed analysis.

References

DeepSeek DSpark: LLM Inference Acceleration via Enhanced Speculative Decoding

NemoClaw Knowledge Wiki

Explorer

token-generation-speed

Token Generation Speed

Key Determinants

References

Graph View

Table of Contents

Backlinks