🗂️ AI & Agents · View mindmap

MoE AI Model

Mixture of Experts (MoE) is a conditional computation architecture for [[concepts/large-language-model]]s comprising multiple Expert Networks and a dynamic Gating Network routing mechanism. Only a sparse subset of experts processes each input token, enabling exponential parameter scaling with bounded compute cost per token.

Architecture & Mechanics

Sparse Activation: Top-k routing selects specific experts per token; inactive experts consume no compute during inference.
Parameter/Compute Decoupling: Total parameter count scales additively with expert depth/width, while FLOPs remain proportional to active parameters.
Routing Strategy: Learnable gating balances load across experts to prevent “expert collapse” and ensure coverage of diverse feature spaces.
Shared Weights: Dense backbone layers handle general representation; expert layers capture specialized sub-patterns.

Advantages

Scalability: Supports massive model capacity on limited hardware via sparsity; active footprint fits within constrained VRAM.
Inference Efficiency: Lower memory bandwidth pressure compared to dense models of equivalent total parameter count.
Specialization: Experts can diverge to optimize for distinct domains, modalities, or reasoning tasks.

Challenges

Routing Overhead: Latency sensitivity to routing logic efficiency and expert load imbalance.
Communication Costs: Distributed training requires frequent cross-device expert exchange; bottlenecked by interconnect bandwidth.
Quantization Sensitivity: Routing logits often require higher precision than weights to maintain top-k selection accuracy under aggressive [[concepts/model-quantization]].

Recent Implementations & Performance

Qwen 3.6 35B-A3B: Demonstrates extreme efficiency; 35B total parameters with approximately 3B active parameters per token.
Llama.cpp Optimization: Advanced kernel optimizations enable fast inference of large MoE structures on consumer-grade hardware.
Low-VRAM Deployment: Successful execution on 6GB VRAM using 8-year-old GPUs, validating MoE viability for edge and legacy hardware [[lab-notes/2026-05-10-Achieving-Fast-35B-MoE-AI-Model-Performance-on-6GB-VRAM|Achieving Fast 35B MoE AI Model Performance on 6GB VRAM with Llama.cpp]].
Active Parameter Ratio: High total-to-active parameter ratios facilitate running foundation-scale capacity on resource-constrained endpoints without proportional inference latency.

Sparse Mixture of Experts
Switch Transformer
Token Routing
Inference Acceleration
Expert Balancing

NemoClaw Knowledge Wiki

Explorer

moe-ai-model

MoE AI Model

Architecture & Mechanics

Advantages

Challenges

Recent Implementations & Performance

Graph View

Table of Contents

Backlinks

NemoClaw Knowledge Wiki

Explorer

moe-ai-model

MoE AI Model

Architecture & Mechanics

Advantages

Challenges

Recent Implementations & Performance

Related Concepts

Graph View

Table of Contents

Backlinks