Quantization Techniques
Process of reducing the numerical precision of model weights and activations to decrease memory footprint and accelerate inference with minimal degradation in model performance.
Role in LLM Inference
- Critical for efficient memory-management during large-language-models deployment, allowing models to fit within limited vram constraints.
- Reduces I/O bandwidth requirements and latency during model loading and runtime execution.
- See Technical Overview of LLM Inference: Loading, Memory, and Quantization for comprehensive analysis of loading mechanics, memory overhead, and quantization effects.
- Enables inference on consumer-grade hardware by compressing parameter size without significant quality loss.
Methods
- Post-Training Quantization (PTQ): Applies quantization after training; fast, no retraining required, may suffer accuracy drop on sensitive layers.
- Quantization-Aware Training (QAT): Simulates quantization noise during training; higher accuracy retention, requires full retraining cycle.
- Weight-Only Quantization: Compresses static weights while maintaining activations in higher precision; standard for many inference engines.
- Mixed-Precision: Assigns variable precision to layers based on sensitivity analysis to balance speed and fidelity.
Formats & Standards
- FP16/BF16: 16-bit floating point; baseline for modern inference, halves size vs FP32.
- INT8/INT4: Integer quantization; aggressive compression, requires hardware support or software emulation.
- GGUF/NNCF: File formats and toolkits implementing quantization workflows for distributed and edge inference.