I am working with an 8-bit AVR chip. There is no data type for a 64-bit double (double just maps to the 32-bit float). However, I will be receiving 64-bit doubles over Serial and need to output 64-bit doubles over Serial.
How can I convert the 64-bit double to a 32-bit float and back again without casting? The format for both the 32-bit and 64-bit will follow IEEE 754. Of course, I assume a loss of precision when converting to the 32-bit float.
For converting from 64-bit to 32-bit float, I am trying this out:
// Script originally from http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1281990303 float convert(uint8_t *in) { union { float real; uint8_t base[4]; } u; uint16_t expd = ((in[7] & 127) << 4) + ((in[6] & 240) >> 4); uint16_t expf = expd ? (expd - 1024) + 128 : 0; u.base[3] = (in[7] & 128) + (expf >> 1); u.base[2] = ((expf & 1) << 7) + ((in[6] & 15) << 3) + ((in[5] & 0xe0) >> 5); u.base[1] = ((in[5] & 0x1f) << 3) + ((in[4] & 0xe0) >> 5); u.base[0] = ((in[4] & 0x1f) << 3) + ((in[3] & 0xe0) >> 5); return u.real; }
For numbers like 1.0 and 2.0, the above works, but when I tested with passing in a 1.1 as a 64-bit double, the output was off by a bit (literally, not a pun!), though this could be an issue with my testing. See:
// Comparison of bits for a float in Java and the bits for a float in C after // converted from a 64-bit double. Last bit is different. // Java code can be found at https://gist.github.com/912636 JAVA FLOAT: 00111111 10001100 11001100 11001101 C CONVERTED FLOAT: 00111111 10001100 11001100 11001100
Double precision floating point is an IEEE 754 standard for encoding binary or decimal floating point numbers in 64 bits (8 bytes).
Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having: Sign bit: 1 bit.
This field contains 52 bits. See the Double-Precision Floating Point figure (Figure 1). The most and least significant bytes of a number are 0 and 3. The most and least significant bits of a double-precision floating-point number are 0 and 63.
The IEEE Standard for Floating-Point Arithmetic is the common convention for representing numbers in binary on computers. In double-precision format, each number takes up 64 bits. Single-precision format uses 32 bits, while half-precision is just 16 bits. To see how this works, let's return to pi.
IEEE specifies five different rounding modes, but the one to use by default is Round half to even. So you have a mantissa of the form 10001100 11001100 11001100 11001100... and you have to round it to 24 bits. Numbering the bits from 0 (most significant), bit 24 is 1; but that is not enough to tell you whether to round bit 23 up or not. If all the remaining bits were 0, you would not round up, because bit 23 is 0 (even). But the remaining bits are not zero, so you round up in all cases.
Some examples:
10001100 11001100 11001100 10000000...(all zero) doesn't round up, because bit 23 is already even.
10001100 11001100 11001101 10000000...(all zero) does round up, because bit 23 is odd.
10001100 11001100 1100110x 10000000...0001 always rounds up, because the remaining bits are not all zero.
10001100 11001100 1100110x 0xxxxxxx... never rounds up, because bit 24 is zero.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With