According to the blog post "https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/", Tensorflow quantizes values before they go into a layer. After being processed by the layer, the values are dequantized. Tensorflow quantizes values by rescaling the values between 0 and 255, so it needs to keep "min" and "max" to dequantize the values.
I would like to ask: 1. how the "min" and "max" in the outputs of a "quantization" op are determined? I mean, if we simply find the minimum and maximum value and set them to 0 and 255, we will get data overflow or underflow when doing convolution. 2. how the "min" and "max" in the outputs of a "convolution" op are determined? Both weights and activations are quantized, so there are two sets of "min" and "max". How does a convolution op combine them to form a single set of "min" and "max"?
TensorFlow uses i.a. gemmlowp for low-precision matrix multiplications. Although 8-bit values are used as inputs, intermediate results are 32-bit values. These 32-bit values are converted back to 8-bit before returning the results.
From https://github.com/google/gemmlowp/blob/master/doc/low-precision.md :
To avoid overflow, we internally accumulate results on more than 8 bits, and at the end we keep only some significant 8 bits.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With