Qwen 3.6 35B-A3B
Overview
- 35B-parameter mixture-of-experts language model from the Qwen Series
- A3B routing variant activates ~3B parameters per token, maximizing throughput vs. memory tradeoffs
- Architecture: Sparse MoE with dense attention, optimized expert gating, and instruction-tuned reasoning/code capabilities
- Training: Multilingual corpus, heavy code synthesis, aligned for complex tool-use and long-context retention
Local Deployment & Performance
- Validated inference on 6GB vram constraints via llamacpp GGUF pipelines
- Achieving Fast 35B MoE AI Model Performance on 6GB VRAM with Llama.cpp documents:
- Successful execution on 8-year-old consumer GPU hardware
- Q4_K_M / Q5_K_S quantization strategies reducing active memory footprint while preserving routing fidelity
- Hybrid CPU/GPU offloading and KV cache paging to mitigate OOM during context expansion
- Interactive token latency achievable through thread scheduling and compute graph optimization
- Requires strict memory-management and inference-optimization trimming for sequences >16K tokens
- Compatible with ollama, ExLlamaV2, vllm, and TensorRT-LLM with architecture-specific patches
Technical Specifications
- Total Parameters: 35B | Active Parameters: ~3B/token
- Context Window: 32K–128K (quantization & RAM dependent)
- Recommended Quantization: GGUF Q4_K_M / Q5_K_S for 6–8GB targets; Q3_K_S for <6GB
- Inference Frameworks: llamacpp, ollama, MLC LLM
- License: Apache 2.0 / Qwen Community License
Related Concepts
- MoE Architecture efficiency tradeoffs
- vram-optimization techniques for sparse activation models
- gguf-format quantization standards
- KV Cache Management for long-context local inference