High Granularity Quantization 2

HGQ2 (High Granularity Quantization 2) is a quantization-aware training framework built on Keras v3, targeting real-time deep learning applications on edge devices like FPGAs. It provides a comprehensive set of tools for creating and training quantized neural networks with minimal effort.

HGQ2 implements an gradient-based automatic bitwidth optimization and quantization-aware training algorithm. By laveraging gradients, it allows for bitwidth optimization at arbitrary granularity, up to per-weight and per-activation level.

Key Features

High Granularity: HGQ supports per-weight and per-activation bitwidth optimization, or any other lower granularity.
Automatic Quantization: Bit-widths are optimized via gradients, no need to manually tune them in general.
What you see is what you get: One get exactly what you get from Keras models from RTL models. - still subject to machine float precision limitation.
Accurate Resource Estimation: EBOPs estimated by HGQ gives a good indication of the actual resource usage on FPGA, either upper limit of LUT (da4ml) or LUT + 55 * DSP (hls4ml).

In addition, this framework improves upon the old HGQ implementation in the following aspects:

Scalability: HGQ2 supports TensorFlow, JAX, and PyTorch. As XLA compilation in JAX and TensorFlow can significantly speed up the training process. Training speed on HGQ2 can be 1.2-5 times faster than the previous implementation.
Quantizers: - Fixed-point: While the last implementation only optimizes the number of floating bits with one way of parameterizing the fixed-point numbers, HGQ2 supports multiple ways of parametrizing them, and allows of optimizing any part of them via gradients. - Minifloat: Training with minifloat quantization is supported, also with surrogate gradients support (alpha quality).
More Layers: More layers are supported now, including the powerful EinsumDense(BatchNorm) layer and the MultiHeadAttention layer with bit-accurate softmax and scaled dot-product attention.

Simple example

import keras
from hgq.layers import QDense, QConv2D
from hgq.config import LayerConfigScope, QuantizerConfigScope

# Setup quantization configuration
# These values are the defaults, just for demonstration purposes here
with (
   # Configuration scope for setting the default quantization type and overflow mode
   # The second configuration scope overrides the first one for the 'datalane' place
   QuantizerConfigScope(place='all', default_q_type='kbi', overflow_mode='SAT_SYM'),
   # Configuration scope for enabling EBOPs and setting the beta0 value
   QuantizerConfigScope(place='datalane', default_q_type='kif', overflow_mode='WRAP'),
   LayerConfigScope(enable_ebops=True, beta0=1e-5),
):
   model = keras.Sequential([
      QConv2D(32, (3, 3), activation='relu'),
      keras.layers.MaxPooling2D((2, 2)),
      keras.layers.Flatten(),
      QDense(10)
   ])

Index

API Reference:

High Granularity Quantization 2

Key Features

Index

Indices and tables