MTP + Ngram Stacked Speculative Decoding in Llama.cpp for LLM Inference

Generated: 2026-05-20 · API: Gemini 2.5 Flash · Modes: Summary

MTP + Ngram Stacked Speculative Decoding in Llama.cpp for LLM Inference

Clip title: MTP + Ngram Stacked in llama.cpp - Qwen3.6 27B at 56 tok/s Locally Author / channel: Fahd Mirza URL: https://www.youtube.com/watch?v=71T4aQiwZ_I

Summary

This video demonstrates how to significantly accelerate Large Language Model (LLM) inference speed using Llama.cpp by combining two speculative decoding methods: Multi-Token Prediction (MTP) and Ngram-mod. The core problem addressed is the inherent slowness of standard LLM inference, where the model generates one token at a time, requiring a full forward pass through billions of parameters for each word. Speculative decoding bypasses this by using a faster “draft” mechanism to guess multiple tokens ahead, which the larger, more accurate model then verifies in a single pass.

The first method, Multi-Token Prediction (MTP), integrates prediction “heads” directly into the main model’s weights. This means no separate, smaller draft model needs to be downloaded or loaded, and it uses minimal extra VRAM. In a previous video, MTP alone boosted the Qwen 3.6-27B model’s tokens per second (TPS) from 22 to 42, representing a substantial improvement in inference speed with mathematically identical output quality.

The second method introduced is Ngram-mod, which operates on an entirely different principle. Unlike MTP, Ngram-mod does not involve a neural network for drafting. Instead, it scans the text already generated in the current conversation for repeating patterns (like common code snippets, variable names, or function signatures). If a pattern is found, Ngram-mod proposes a sequence of draft tokens based on that pattern. This text-lookup approach is computationally inexpensive, essentially costing zero, and can achieve acceptance rates exceeding 90% for repetitive tasks like code editing.

The true innovation highlighted in this video is the synergistic stacking of both MTP and Ngram-mod. MTP is leveraged for parts of the generation where the model is creating novel content or there’s no recognizable pattern. Ngram-mod then seamlessly takes over when it identifies patterns in the ongoing text, proposing drafts based on historical context. The demonstration showcases the combined power of these two methods on a Qwen 3.6-27B model running locally on an NVIDIA RTX A6000. By simply enabling MTP and Ngram-mod through specific Llama.cpp server flags, the inference speed jumped to an impressive 56.6 tokens per second, nearly tripling the baseline speed of 22 TPS achieved without either method, and further enhancing the 42 TPS achieved with MTP alone. This demonstrates a highly efficient way to accelerate local LLM inference using mainline Llama.cpp without custom forks or additional model downloads.

Video Description & Links

Description

Stack MTP and ngram-mod together in mainline llama.cpp and Qwen3.6-27B jumps from 22 to 56 tokens per second with no extra models and no custom builds.

🔥 Get 50% Discount on any A6000 or A5000 GPU rental, use following link and coupon:

https://bit.ly/fahd-mirza Coupon code: FahdMirza

🔥 Buy Me a Coffee to support the channel: https://ko-fi.com/fahdmirza

llamacpp mtp multitokenprediction speculativedecoding ngrammod

PLEASE FOLLOW ME: ▶ LinkedIn: https://www.linkedin.com/in/fahdmirza/ ▶ YouTube: https://www.youtube.com/@fahdmirza ▶ Blog: https://www.fahdmirza.com

RESOURCES:

▶ https://github.com/ggml-org/llama.cpp/pull/22673

URLs

Multi-Token Prediction (MTP) — Wikipedia
Speculative Decoding — Wikipedia
Large Language Model (LLM) Inference — Wikipedia
Ngram-mod — Wikipedia
LLM Inference Acceleration — Wikipedia
Draft Mechanism — Wikipedia
Token Prediction Heads — Wikipedia
Pattern Matching — Wikipedia
Repetitive Text Detection — Wikipedia
Local LLM Deployment — Wikipedia
GPU Inference — Wikipedia
Throughput Optimization — Wikipedia
Zero-Cost Drafting — Wikipedia
Server Flags — Wikipedia
Mainline Implementation — Wikipedia
Token Acceptance Rate — Wikipedia
Forward Pass Optimization — Wikipedia

Fahd Mirza — Wikipedia
Llama.cpp — Wikipedia
Gemini — Wikipedia
Qwen 3.6-27B — Wikipedia
NVIDIA RTX A6000 — Wikipedia
ggml-org — Wikipedia
GitHub — Wikipedia
LinkedIn — Wikipedia
YouTube — Wikipedia
Ko-fi — Wikipedia
A5000 — Wikipedia
FahdMirza Blog — Wikipedia

NemoClaw Knowledge Wiki

Explorer

MTP + Ngram Stacked Speculative Decoding in Llama.cpp for LLM Inference

MTP + Ngram Stacked Speculative Decoding in Llama.cpp for LLM Inference

Summary

Video Description & Links

Description

URLs

Graph View

Table of Contents

Backlinks

NemoClaw Knowledge Wiki

Explorer

MTP + Ngram Stacked Speculative Decoding in Llama.cpp for LLM Inference

MTP + Ngram Stacked Speculative Decoding in Llama.cpp for LLM Inference

Summary

Video Description & Links

Description

URLs

Related Concepts

Related Entities

Graph View

Table of Contents

Backlinks