Generated: 2026-05-20 · API: Gemini 2.5 Flash · Modes: Summary


MTP + Ngram Stacked Speculative Decoding in Llama.cpp for LLM Inference

Clip title: MTP + Ngram Stacked in llama.cpp - Qwen3.6 27B at 56 tok/s Locally Author / channel: Fahd Mirza URL: https://www.youtube.com/watch?v=71T4aQiwZ_I

Summary

This video demonstrates how to significantly accelerate Large Language Model (LLM) inference speed using Llama.cpp by combining two speculative decoding methods: Multi-Token Prediction (MTP) and Ngram-mod. The core problem addressed is the inherent slowness of standard LLM inference, where the model generates one token at a time, requiring a full forward pass through billions of parameters for each word. Speculative decoding bypasses this by using a faster “draft” mechanism to guess multiple tokens ahead, which the larger, more accurate model then verifies in a single pass.

The first method, Multi-Token Prediction (MTP), integrates prediction “heads” directly into the main model’s weights. This means no separate, smaller draft model needs to be downloaded or loaded, and it uses minimal extra VRAM. In a previous video, MTP alone boosted the Qwen 3.6-27B model’s tokens per second (TPS) from 22 to 42, representing a substantial improvement in inference speed with mathematically identical output quality.

The second method introduced is Ngram-mod, which operates on an entirely different principle. Unlike MTP, Ngram-mod does not involve a neural network for drafting. Instead, it scans the text already generated in the current conversation for repeating patterns (like common code snippets, variable names, or function signatures). If a pattern is found, Ngram-mod proposes a sequence of draft tokens based on that pattern. This text-lookup approach is computationally inexpensive, essentially costing zero, and can achieve acceptance rates exceeding 90% for repetitive tasks like code editing.

The true innovation highlighted in this video is the synergistic stacking of both MTP and Ngram-mod. MTP is leveraged for parts of the generation where the model is creating novel content or there’s no recognizable pattern. Ngram-mod then seamlessly takes over when it identifies patterns in the ongoing text, proposing drafts based on historical context. The demonstration showcases the combined power of these two methods on a Qwen 3.6-27B model running locally on an NVIDIA RTX A6000. By simply enabling MTP and Ngram-mod through specific Llama.cpp server flags, the inference speed jumped to an impressive 56.6 tokens per second, nearly tripling the baseline speed of 22 TPS achieved without either method, and further enhancing the 42 TPS achieved with MTP alone. This demonstrates a highly efficient way to accelerate local LLM inference using mainline Llama.cpp without custom forks or additional model downloads.

Description

Stack MTP and ngram-mod together in mainline llama.cpp and Qwen3.6-27B jumps from 22 to 56 tokens per second with no extra models and no custom builds.

🔥 Get 50% Discount on any A6000 or A5000 GPU rental, use following link and coupon:

https://bit.ly/fahd-mirza Coupon code: FahdMirza

🔥 Buy Me a Coffee to support the channel: https://ko-fi.com/fahdmirza

llamacpp mtp multitokenprediction speculativedecoding ngrammod

PLEASE FOLLOW ME: ▶ LinkedIn: https://www.linkedin.com/in/fahdmirza/ ▶ YouTube: https://www.youtube.com/@fahdmirza ▶ Blog: https://www.fahdmirza.com

RESOURCES:

https://github.com/ggml-org/llama.cpp/pull/22673

All rights reserved © Fahd Mirza

URLs