Quantization Techniques

Process of reducing the numerical precision of model weights and activations to decrease memory footprint and accelerate inference with minimal degradation in model performance.

Role in LLM Inference

Methods

  • Post-Training Quantization (PTQ): Applies quantization after training; fast, no retraining required, may suffer accuracy drop on sensitive layers.
  • Quantization-Aware Training (QAT): Simulates quantization noise during training; higher accuracy retention, requires full retraining cycle.
  • Weight-Only Quantization: Compresses static weights while maintaining activations in higher precision; standard for many inference engines.
  • Mixed-Precision: Assigns variable precision to layers based on sensitivity analysis to balance speed and fidelity.

Formats & Standards

  • FP16/BF16: 16-bit floating point; baseline for modern inference, halves size vs FP32.
  • INT8/INT4: Integer quantization; aggressive compression, requires hardware support or software emulation.
  • GGUF/NNCF: File formats and toolkits implementing quantization workflows for distributed and edge inference.