Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Truncating a double to a float in C

This a very simple question, but an important one since it affects my whole project tremendously.

Suppose I have the following code snipet:

unsigned int x = 0xffffffff;
float f = (float)((double)x * (double)2.328306436538696e-010); //  x/2^32

I would expect that f be something like 0.99999, but instead, it rounds up to 1, since it's the closest float approximation. That's not good since I need float values on the interval of [0,1), not [0,1]. I'm sure it's something simple, but I'd appreciate some help.

like image 408
audiFanatic Avatar asked Aug 06 '13 16:08

audiFanatic


3 Answers

In C (since C99), you can change the rounding direction with fesetround from libm

#include <stdio.h>
#include <fenv.h>
int main()
{
    #pragma STDC FENV_ACCESS ON
    fesetround(FE_DOWNWARD);
    // volatile -- uncomment for GNU gcc and whoever else doesn't support FENV
    unsigned long x = 0xffffffff;
    float f = (float)((double)x * (double)2.328306436538696e-010); //  x/2^32
    printf("%.50f\n", f);
}

Tested with IBM XL, Sun Studio, clang, GNU gcc. This gives me 0.99999994039535522460937500000000000000000000000000 in all cases

like image 149
Cubbi Avatar answered Sep 25 '22 21:09

Cubbi


The value above which a double rounds to 1 or more when converted to float in the default IEEE 754 rounding mode is 0x1.ffffffp-1 (in C99's hexadecimal notation, since your question is tagged “C”).

Your options are:

  1. turn the FPU rounding mode to round-downward before the conversion, or
  2. multiply by (0x1.ffffffp-1 / 0xffffffffp0) (give or take one ULP) to exploit the full single-precision range [0, 1) without getting the value 1.0f.

Method 2 leads to use the constant 0x1.ffffff01fffffp-33:

double factor = nextafter(0x1.ffffffp-1 / 0xffffffffp0, 0.0);
unsigned int x = 0xffffffff;
float f = (float)((double)x * factor);
printf("factor:%a\nunrounded:%a\nresult:%a\n", factor, (double)x * factor, f);

Prints:

factor:0x1.ffffff01fffffp-33
unrounded:0x1.fffffefffffffp-1
result:0x1.fffffep-1
like image 42
Pascal Cuoq Avatar answered Sep 24 '22 21:09

Pascal Cuoq


You could just truncate the value to maximum precision (keeping the 24 high bits) and divide by 2^24 to get the closest value a float can represent without being rounded to 1;

unsigned int i = 0xffffffff;
float value = (float)(i>>8)/(1<<24);

printf("%.20f\n", value);
printf("%a\n", value);

>>> 0.99999994039535522461
>>> 0x1.fffffep-1
like image 45
Joachim Isaksson Avatar answered Sep 24 '22 21:09

Joachim Isaksson