01/26/2022 | News release | Distributed by Public on 01/27/2022 15:22
Usually, when you're presented with three options, such as powerful, efficient, and low cost, you're also faced with the conundrum where you're only allowed to pick one or two. Machine learning (ML) practitioners developing neural networks for mobile, don't always have the luxury of picking their top one or two choices because their models generally need to be fast, small, and consume low power to be effective.
In our recent blog post, Neural Network Optimization with AIMET, we discussed how Qualcomm Innovation Center's (QuIC's) open-source AI Model Efficiency Toolkit (AIMET) provides advanced quantization and compression techniques and how it can be used with the Qualcomm Neural Processing SDK. With AIMET, developers can optimize their ML models to not only reduce their size, but also reduce the amount of power required for inference while maintaining accuracy requirements.
Previously, Qualcomm AI Research published their whitepaper: A White Paper on Neural Network Quantization that provides in-depth treatment of quantization. Their subsequent whitepaper: Neural Network Quantization with AI Model Efficiency Toolkit (AIMET), provides extensive details and a practical guide for two categories of quantization using AIMET:
In this blog post, we look at the PTQ techniques discussed in the whitepaper, highlighting their key attributes and strengths and when to use them.
To understand the significance of PTQ methods, it's important to remember what quantization is trying to achieve. In a nutshell, quantization involves mapping a set of values from a large domain onto a smaller domain. This allows us to use smaller bit-width representations for these values (e.g., 8-bit integers rather than 32-bit floating point values), thus reducing the number of bits that need to be stored, transferred, and processed. Furthermore, most processors, including the Qualcomm Hexagon DSP found on Snapdragon mobile platforms, generally perform fixed-point (i.e., integer) math much faster and more efficiently than floating-point math.
Since today's neural networks typically represent weight and activation tensors using 32-bit floating point values, it can be highly beneficial to quantize these values to smaller representations, down to as low as 4-bit.
However, as the whitepaper points out, quantization can introduce noise and thus reduce accuracy. This can happen for various reasons, such as when large outlier values are clipped to quantized ranges. That's why we've put so much effort into advancing quantization methods for different types of neural networks.
AIMET's PTQ methods currently include:
These methods are intended to be used in different parts of a typical optimization workflow.
The whitepaper proposes the following workflow for employing AIMET's PTQ methods:
Given a pre-trained FP model, the workflow involves the following:
Let's take a closer look at CLE, Bias Correction, and AdaRound.
CLE equalizes the weight ranges in the network by using the scale-equivariance property of activation functions (i.e., equalize weight tensors to reduce the amplitude variation across channels). This improves the quantization accuracy performance for many common computer vision architectures. CLE is especially beneficial for models with depth-wise separable convolution layers.
AIMET has APIs for CLE, including the equalize_model() function for PyTorch , as shown in the following code example:
from torchvision import models
from aimet_torch.cross_layer_equalization import equalize_model
model = models.resnet18(pretrained=True).eval()
input_shape = (1, 3, 224, 224)
# Performs batch normalization folding, Cross-layer scaling and High-bias absorption
# It must be noted that above API will equalize the given model in-place.
equalize_model(model, input_shape)
Bias Correction fixes shifts in layer outputs introduced due to quantization. When noise due to weight quantization is biased, it also introduces a shift, (i.e., bias, in the layer activations). The root cause is often due to clipped outlier values which shift the expected distribution. Bias Correction adapts a layer's bias parameter using a correction term to correct for the bias in the noise, and thus recovers at least some of the original model's accuracy.
AIMET supports two Bias Correction approaches:
Typically, quantization projects values from a larger domain onto a smaller domain known as the grid. It then rounds values to the nearest grid point (e.g., a whole number) as shown in Figure 2:
However, rounding-to-nearest is not always optimal. AdaRound is an effective and efficient method that uses a small amount of data to determine how to make the rounding decision and adapt the weights for better quantized performance. AdaRound is particularly useful for quantizing to a low bit-width, such as 4-bit integer, with a post-training approach.
AIMET provides a high-level API for performing AdaRound that exports a model with updated weights and a JSON file with the corresponding encodings.
For additional information about AdaRound, check out Up or Down? Adaptive Rounding for Post-Training Quantization.
The whitepaper shows impressive results for AIMET's PTQ methods.
Table 1 below shows the accuracy of common neural network models for object classification and semantic segmentation, both as standalone FP32 models and after quantization to 8-bit integers, using AIMET's CLE and Bias Correction methods:
Table 1- Accuracies of FP32 models versus those optimized with AIMET's CLE and Bias Correction methodsModel | Baseline (FP32 Model) | AIMET 8-bit Quantized (with CLE and Bias Correction) |
MobileNetv2 (Top-1 Accuracy) |
71.72% |
71.08% |
ResNet (Top-1 Accuracy) |
76.05% |
75.45% |
DeepLab v3 (Mean IoU) |
72.65% |
71.91% |
In all three cases, the loss in accuracy (versus the FP32 model) is less than 1%, while model size decreased by four times, from 32-bit to 8-bit. Power and performance improvements depend on the model and the hardware, but in general, going from FP32 to INT8 can provide up to a 16 times improvement in power efficiency.
Table 2 - Comparison of a model's accuracy as FP32 versus quantization using standard rounding to the nearest grid point, and quantization with rounding guided by AdaRound for an object detection model
Configuration |
Mean Average Precision |
FP32 |
82.20% |
Nearest Rounding (W8A8) |
49.85% |
AdaRound (W8A8) |
81.21% |
Table 2 shows the model accuracy of FP32 values compared to quantization using nearest-rounding, or AdaRound rounding, on an object detection model for Advanced Driver-Assistance System (ADAS):
Here we can see a significant difference in accuracy between standard nearest-rounding and AdaRound, the latter being within 1% of the original FP32 model. Again, quantizing from 32-bit to 8-bit reduced the model size by four times.
For additional information, be sure to check out the whitepaper here, as well as the following resources: