Google QAT

Google Quantization-Aware Training (QAT) refers to the process of simulating quantization effects during the training phase of a neural network. Unlike Post-Training Quantization (PTQ), which applies quantization to already-trained weights, QAT allows the model to adapt its parameters to the precision loss introduced by lower-bit formats (e.g., 4-bit or 8-bit). This typically results in superior accuracy retention and robustness compared PTQ, particularly for smaller models where precision loss is more impactful.

Key Characteristics

  • Training Integration: Quantization simulation is embedded within the backward pass, allowing gradients to flow through fake quantization nodes.
  • Accuracy Preservation: Mitigates performance degradation associated with aggressive bit-width reduction (e.g., INT4).
  • Compute Overhead: Higher training cost due to simulated precision loss and additional calibration steps during training.

Implementations & Variants

Google has applied QAT techniques across various model families, including Gemma. Specific implementations often target specific hardware accelerators or deployment constraints.

Gemma Series

In the context of the Gemma family of open-weight models, Google provides officially quantized versions to facilitate efficient deployment on consumer-grade hardware.