How can I convert an integer
to a half precision float
(which is to be stored into an array unsigned char[2]
). The range to the input int will be from 1-65535. Precision is really not a concern.
I am doing something similar for converting to 16bit int
into an unsigned char[2]
, but I understand there is not half precision float
C++ datatype. Example of this below:
int16_t position16int = (int16_t)data;
memcpy(&dataArray, &position16int, 2);
Once you've created a variable of a certain type, it is locked in as that type forever. So in your case, you created i as an int. You can't reassign i as a float after that. An int is always an int and will remain an int as long as it was declared as an int and will never be able to change into anything but an int.
A Half is a binary floating-point number that occupies 16 bits. With half the number of bits as float, a Half number can represent values in the range ±65504. More formally, the Half type is defined as a base-2 16-bit interchange format meant to support the exchange of floating-point data between implementations.
It's a very straightforward thing, all the info you need is in Wikipedia.
Sample implementation:
#include <stdio.h>
unsigned int2hfloat(int x)
{
unsigned sign = x < 0;
unsigned absx = ((unsigned)x ^ -sign) + sign; // safe abs(x)
unsigned tmp = absx, manbits = 0;
int exp = 0, truncated = 0;
// calculate the number of bits needed for the mantissa
while (tmp)
{
tmp >>= 1;
manbits++;
}
// half-precision floats have 11 bits in the mantissa.
// truncate the excess or insert the lacking 0s until there are 11.
if (manbits)
{
exp = 10; // exp bias because 1.0 is at bit position 10
while (manbits > 11)
{
truncated |= absx & 1;
absx >>= 1;
manbits--;
exp++;
}
while (manbits < 11)
{
absx <<= 1;
manbits++;
exp--;
}
}
if (exp + truncated > 15)
{
// absx was too big, force it to +/- infinity
exp = 31; // special infinity value
absx = 0;
}
else if (manbits)
{
// normal case, absx > 0
exp += 15; // bias the exponent
}
return (sign << 15) | ((unsigned)exp << 10) | (absx & ((1u<<10)-1));
}
int main(void)
{
printf(" 0: 0x%04X\n", int2hfloat(0));
printf("-1: 0x%04X\n", int2hfloat(-1));
printf("+1: 0x%04X\n", int2hfloat(+1));
printf("-2: 0x%04X\n", int2hfloat(-2));
printf("+2: 0x%04X\n", int2hfloat(+2));
printf("-3: 0x%04X\n", int2hfloat(-3));
printf("+3: 0x%04X\n", int2hfloat(+3));
printf("-2047: 0x%04X\n", int2hfloat(-2047));
printf("+2047: 0x%04X\n", int2hfloat(+2047));
printf("-2048: 0x%04X\n", int2hfloat(-2048));
printf("+2048: 0x%04X\n", int2hfloat(+2048));
printf("-2049: 0x%04X\n", int2hfloat(-2049)); // first inexact integer
printf("+2049: 0x%04X\n", int2hfloat(+2049));
printf("-2050: 0x%04X\n", int2hfloat(-2050));
printf("+2050: 0x%04X\n", int2hfloat(+2050));
printf("-32752: 0x%04X\n", int2hfloat(-32752));
printf("+32752: 0x%04X\n", int2hfloat(+32752));
printf("-32768: 0x%04X\n", int2hfloat(-32768));
printf("+32768: 0x%04X\n", int2hfloat(+32768));
printf("-65504: 0x%04X\n", int2hfloat(-65504)); // legal maximum
printf("+65504: 0x%04X\n", int2hfloat(+65504));
printf("-65505: 0x%04X\n", int2hfloat(-65505)); // infinity from here on
printf("+65505: 0x%04X\n", int2hfloat(+65505));
printf("-65535: 0x%04X\n", int2hfloat(-65535));
printf("+65535: 0x%04X\n", int2hfloat(+65535));
return 0;
}
Output (ideone):
0: 0x0000
-1: 0xBC00
+1: 0x3C00
-2: 0xC000
+2: 0x4000
-3: 0xC200
+3: 0x4200
-2047: 0xE7FF
+2047: 0x67FF
-2048: 0xE800
+2048: 0x6800
-2049: 0xE800
+2049: 0x6800
-2050: 0xE801
+2050: 0x6801
-32752: 0xF7FF
+32752: 0x77FF
-32768: 0xF800
+32768: 0x7800
-65504: 0xFBFF
+65504: 0x7BFF
-65505: 0xFC00
+65505: 0x7C00
-65535: 0xFC00
+65535: 0x7C00
I asked the question of how to convert 32-bit floating points to 16-bit floating point.
Float32 to Float16
So from that you could very easily convert the int to a float and then use the question above to create a 16-bit float. I would suggest this is probably much easier than going from int directly to 16-bit float. Effectively by converting to 32-bit float you have done most of the hardwork and then you just need to shift a few bits around.
Edit: Looking at Alexey's excellent answer I think its highly likely that using a hardware int to float conversion and then bit shifting it around is likely to be a fair bit faster than his method. Might be worth profiling both methods and comparing them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With