MoE AI Model
Mixture of Experts (MoE) is a conditional computation architecture for [[concepts/large-language-model]]s comprising multiple Expert Networks and a dynamic Gating Network routing mechanism. Only a sparse subset of experts processes each input token, enabling exponential parameter scaling with bounded compute cost per token.
Architecture & Mechanics
- Sparse Activation: Top-k routing selects specific experts per token; inactive experts consume no compute during inference.
- Parameter/Compute Decoupling: Total parameter count scales additively with expert depth/width, while FLOPs remain proportional to active parameters.
- Routing Strategy: Learnable gating balances load across experts to prevent “expert collapse” and ensure coverage of diverse feature spaces.
- Shared Weights: Dense backbone layers handle general representation; expert layers capture specialized sub-patterns.
Advantages
- Scalability: Supports massive model capacity on limited hardware via sparsity; active footprint fits within constrained VRAM.
- Inference Efficiency: Lower memory bandwidth pressure compared to dense models of equivalent total parameter count.
- Specialization: Experts can diverge to optimize for distinct domains, modalities, or reasoning tasks.
Challenges
- Routing Overhead: Latency sensitivity to routing logic efficiency and expert load imbalance.
- Communication Costs: Distributed training requires frequent cross-device expert exchange; bottlenecked by interconnect bandwidth.
- Quantization Sensitivity: Routing logits often require higher precision than weights to maintain top-k selection accuracy under aggressive
[[concepts/model-quantization]].
Recent Implementations & Performance
- Qwen 3.6 35B-A3B: Demonstrates extreme efficiency; 35B total parameters with approximately 3B active parameters per token.
- Llama.cpp Optimization: Advanced kernel optimizations enable fast inference of large MoE structures on consumer-grade hardware.
- Low-VRAM Deployment: Successful execution on 6GB VRAM using 8-year-old GPUs, validating MoE viability for edge and legacy hardware
[[lab-notes/2026-05-10-Achieving-Fast-35B-MoE-AI-Model-Performance-on-6GB-VRAM|Achieving Fast 35B MoE AI Model Performance on 6GB VRAM with Llama.cpp]]. - Active Parameter Ratio: High total-to-active parameter ratios facilitate running foundation-scale capacity on resource-constrained endpoints without proportional inference latency.
Related Concepts
Sparse Mixture of ExpertsSwitch TransformerToken RoutingInference AccelerationExpert Balancing