Converting double to float without relying on the FPU rounding mode

Tags:

Does anyone have handy the snippets of code to convert an IEEE 754 double to the immediately inferior (resp. superior) float, without changing or assuming anything about the FPU's current rounding mode?

Note: this constraint probably implies not using the FPU at all. I expect the simplest way to do it in these conditions is to read the bits of the double in a 64-bit long and to work with that.

You can assume the endianness of your choice for simplicity, and that the double in question is available through the d field of the union below:

union double_bits
{
  long i;
  double d;
};

I would try to do it myself but I am certain I would introduce hard-to-notice bugs for denormalized or negative numbers.

642

asked Jan 06 '10 09:01

Pascal Cuoq

1 Answers

I think the following works, but I will state my assumptions first:

floating-point numbers are stored in IEEE-754 format on your implementation,
No overflow,
You have nextafterf() available (it's specified in C99).

Also, most likely, this method is not very efficient.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])
{
    /* Change to non-zero for superior, otherwise inferior */
    int superior = 0;

    /* double value to convert */
    double d = 0.1;

    float f;
    double tmp = d;

    if (argc > 1)
        d = strtod(argv[1], NULL);

    /* First, get an approximation of the double value */
    f = d;

    /* Now, convert that back to double */
    tmp = f;

    /* Print the numbers. %a is C99 */
    printf("Double: %.20f (%a)\n", d, d);
    printf("Float: %.20f (%a)\n", f, f);
    printf("tmp: %.20f (%a)\n", tmp, tmp);

    if (superior) {
        /* If we wanted superior, and got a smaller value,
           get the next value */
        if (tmp < d)
            f = nextafterf(f, INFINITY);
    } else {
        if (tmp > d)
            f = nextafterf(f, -INFINITY);
    }
    printf("converted: %.20f (%a)\n", f, f);

    return 0;
}

On my machine, it prints:

Double: 0.10000000000000000555 (0x1.999999999999ap-4)
Float: 0.10000000149011611938 (0x1.99999ap-4)
tmp: 0.10000000149011611938 (0x1.99999ap-4)
converted: 0.09999999403953552246 (0x1.999998p-4)

The idea is that I am converting the double value to a float value—this could be less than or greater than the double value depending upon the rounding mode. When converted back to double, we can check if it is smaller or greater than the original value. Then, if the value of the float is not in the right direction, we look at the next float number from the converted number in the original number's direction.

180

answered Sep 30 '22 09:09

Alok Singhal

Related questions
                            
                                GDB-remote + qemu reports unexpected memory address for static C variable
                            
                                Can we assume that x == (int)sqrt(x * x) for all positive integers?
                            
                                Produce Identical Random Number Sequence between C and Fortran (gcc 10.3.0)
                            
                                How you avoid implicit conversion from short to integer during addition?
                            
                                what does this macro definition do？
                            
                                RDMS for C language newbie? [closed]
                            
                                Cross-platform editor control [closed]
                            
                                Cancel libcurl easy handle
                            
                                C : how pthread dataspecific works?
                            
                                calling a callback from a thread using function pointers
                            
                                C: How long can a double be when printed through printf()
                            
                                using unions in function parameters
                            
                                How to change the color of a textual cue when sending an EM_SETCUEBANNER Message?
                            
                                Querying MX record in C linux
                            
                                Tool for program statistics
                            
                                Adding missing NULL checks after malloc with coccinelle
                            
                                _setmaxstdio max open files is 2048 only?
                            
                                Using Sparse to check C code
                            
                                sem_timedwait not supported properly on RedHat Enterprise Linux 5.3 onwards?
                            
                                What is the fastest semi-arbitrary precision math library? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting double to float without relying on the FPU rounding mode

Tags:

c

floating-point

bit-manipulation

ieee-754

Pascal Cuoq

People also ask

1 Answers

Alok Singhal

Recent Activity

Donate For Us