Say I have a float in the range of [0, 1] and I want to quantize and store it in an unsigned byte. Sounds like a no-brainer, but in fact it's quite complicated:
The obvious solution looks like this:
unsigned char QuantizeFloat(float a)
{
return (unsigned char)(a * 255.0f);
}
This works in so far that I get all numbers from 0 to 255, but the distribution of the integers is not even. The function only returns 255
if a is exactly 1.0f
. Not a good solution.
If I do proper rounding I just shift the problem:
unsigned char QuantizeFloat(float a)
{
return (unsigned char)(a * 255.0f + 0.5f);
}
Here the the result 0
only covers half of the float-range than any other number.
How do I do a quantization with equal distribution of the floating point range? Ideally I would like to get a equal distribution of integers if I quantize equally distributed random floats.
Any ideas?
Btw: Also my code is in C the problem is language-agnostic. For the non-C people: Just assume that float
to int
conversion truncates the float.
EDIT: Since we had some confusion here: I need a mapping that maps the smallest input float (0) to the smallest unsigned char, and the highest float of my range (1.0f) to the highest unsigned byte (255).
A float value can be converted to an int value no larger than the input by using the math. floor() function, whereas it can also be converted to an int value which is the smallest integer greater than the input using math. ceil() function. The math module is to be imported in order to use these methods.
Integer quantization is an optimization strategy that converts 32-bit floating-point numbers (such as weights and activation outputs) to the nearest 8-bit fixed-point numbers. This results in a smaller model and increased inferencing speed, which is valuable for low-power devices such as microcontrollers.
Since the high-order bit of the mantissa is always 1, it is not stored in the number. This representation gives a range of approximately 3.4E-38 to 3.4E+38 for type float.
How about a * 256f
with a check to reduce 256 to 255? So something like:
return (unsigned char) (min(255, (int) (a * 256f)));
(For a suitable min function on your platform - I can't remember the C function for it.)
Basically you want to divide the range into 256 equal portions, which is what that should do. The edge case for 1.0 going to 256 and requiring rounding down is just because the domain is inclusive at both ends.
I think what you are looking for is this:
unsigned char QuantizeFloat (float a)
{
return (unsigned char) (a * 256.0f);
}
This will map uniform float values in [0, 1] to uniform byte values in [0, 255]. All values in [i/256, (i+1)/256[ (that is excluding (i+1)/256), for i in 0..255, are mapped to i. What might be undesirable is that 1.0f is mapped to 256.0f which wraps around to 0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With