Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting double to float without relying on the FPU rounding mode

Does anyone have handy the snippets of code to convert an IEEE 754 double to the immediately inferior (resp. superior) float, without changing or assuming anything about the FPU's current rounding mode?

Note: this constraint probably implies not using the FPU at all. I expect the simplest way to do it in these conditions is to read the bits of the double in a 64-bit long and to work with that.

You can assume the endianness of your choice for simplicity, and that the double in question is available through the d field of the union below:

union double_bits
{
  long i;
  double d;
};

I would try to do it myself but I am certain I would introduce hard-to-notice bugs for denormalized or negative numbers.

like image 642
Pascal Cuoq Avatar asked Jan 06 '10 09:01

Pascal Cuoq


People also ask

How is double converted to float?

Using TypeCasting to Convert Double to Float in Java To define a float type, we must use the suffix f or F , whereas it is optional to use the suffix d or D for double. The default value of float is 0.0f , while the default value of double is 0.0d . By default, float numbers are treated as double in Java.

What is NaN in hex?

NaN. A NaN (Not a Number) can be represented by any of the many bit patterns that satisfy the definition of NaN. The hex value of the NaN shown in TABLE 2-5 is just one of the many bit patterns that can be used to represent a NaN.


1 Answers

I think the following works, but I will state my assumptions first:

  • floating-point numbers are stored in IEEE-754 format on your implementation,
  • No overflow,
  • You have nextafterf() available (it's specified in C99).

Also, most likely, this method is not very efficient.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])
{
    /* Change to non-zero for superior, otherwise inferior */
    int superior = 0;

    /* double value to convert */
    double d = 0.1;

    float f;
    double tmp = d;

    if (argc > 1)
        d = strtod(argv[1], NULL);

    /* First, get an approximation of the double value */
    f = d;

    /* Now, convert that back to double */
    tmp = f;

    /* Print the numbers. %a is C99 */
    printf("Double: %.20f (%a)\n", d, d);
    printf("Float: %.20f (%a)\n", f, f);
    printf("tmp: %.20f (%a)\n", tmp, tmp);

    if (superior) {
        /* If we wanted superior, and got a smaller value,
           get the next value */
        if (tmp < d)
            f = nextafterf(f, INFINITY);
    } else {
        if (tmp > d)
            f = nextafterf(f, -INFINITY);
    }
    printf("converted: %.20f (%a)\n", f, f);

    return 0;
}

On my machine, it prints:

Double: 0.10000000000000000555 (0x1.999999999999ap-4)
Float: 0.10000000149011611938 (0x1.99999ap-4)
tmp: 0.10000000149011611938 (0x1.99999ap-4)
converted: 0.09999999403953552246 (0x1.999998p-4)

The idea is that I am converting the double value to a float value—this could be less than or greater than the double value depending upon the rounding mode. When converted back to double, we can check if it is smaller or greater than the original value. Then, if the value of the float is not in the right direction, we look at the next float number from the converted number in the original number's direction.

like image 180
Alok Singhal Avatar answered Sep 30 '22 09:09

Alok Singhal