I'm wondering what the current available options are for simulating BatchNorm folding during quantization aware training in Tensorflow 2. Tensorflow 1 has the tf.contrib.quantize.create_training_graph
function which inserts FakeQuantization layers into the graph and takes care of simulating batch normalization folding (according to this white paper).
Tensorflow 2 has a tutorial on how to use quantization in their recently adopted tf.keras
API, but they don't mention anything about batch normalization. I tried the following simple example with a BatchNorm layer:
import tensorflow_model_optimization as tfmo
model = tf.keras.Sequential([
l.Conv2D(32, 5, padding='same', activation='relu', input_shape=input_shape),
l.MaxPooling2D((2, 2), (2, 2), padding='same'),
l.Conv2D(64, 5, padding='same', activation='relu'),
l.BatchNormalization(), # BN!
l.MaxPooling2D((2, 2), (2, 2), padding='same'),
l.Flatten(),
l.Dense(1024, activation='relu'),
l.Dropout(0.4),
l.Dense(num_classes),
l.Softmax(),
])
model = tfmo.quantization.keras.quantize_model(model)
It however gives the following exception:
RuntimeError: Layer batch_normalization:<class 'tensorflow.python.keras.layers.normalization.BatchNormalization'> is not supported. You can quantize this layer by passing a `tfmot.quantization.keras.QuantizeConfig` instance to the `quantize_annotate_layer` API.
which indicates that TF does not know what to do with it.
I also saw this related topic where they apply tf.contrib.quantize.create_training_graph
on a keras constructed model. They however don't use BatchNorm layers, so I'm not sure this will work.
So what are the options for using this BatchNorm folding feature in TF2? Can this be done from the keras API, or should I switch back to the TensorFlow 1 API and define a graph the old way?
Quantization aware training emulates inference-time quantization, creating a model that downstream tools will use to produce actually quantized models. The quantized models use lower-precision (e.g. 8-bit instead of 32-bit float), leading to benefits during deployment.
Quantization-aware training helps you train DNNs for lower precision INT8 deployment, without compromising on accuracy. This is achieved by modeling quantization errors during training which helps in maintaining accuracy as compared to FP16 or FP32.
Post-training quantization includes general techniques to reduce CPU and hardware accelerator latency, processing, power, and model size with little degradation in model accuracy. These techniques can be performed on an already-trained float TensorFlow model and applied during TensorFlow Lite conversion.
As the name suggests scale parameter is used to scale back the low-precision values back to the floating-point values. It is stored in full precision for better accuracy. On the other hand, zero-point is a low precision value that represents the quantized value that will represent the real value 0.
If you add BatchNormalization before activation, you would not have issues with Quantization. Note: Quantization is supported in BatchNormalization only if it the layer is exactly after Conv2D layer. https://www.tensorflow.org/model_optimization/guide/quantization/training
# Change
l.Conv2D(64, 5, padding='same', activation='relu'),
l.BatchNormalization(), # BN!
# with this
l.Conv2D(64, 5, padding='same'),
l.BatchNormalization(),
l.Activation('relu'),
#Other way of declaring the same
o = (Conv2D(512, (3, 3), padding='valid' , data_format=IMAGE_ORDERING))(o)
o = (BatchNormalization())(o)
o = Activation('relu')(o)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With