I have come across some behaviour with the float
type in C that I do not understand, and was hoping might be explained. Using the macros defined in float.h
I can determine the maximum/minimum values that the datatype can store on the given hardware. However when performing a calculation that should not exceed these limits, I find that a typed float
variable fails where a double
succeeds.
The following is a minimal example, which compiles on my machine.
#include <stdio.h>
#include <stdlib.h>
#include <float.h>
int main(int argc, char **argv)
{
int gridsize;
long gridsize3;
float *datagrid;
float sumval_f;
double sumval_d;
long i;
gridsize = 512;
gridsize3 = (long)gridsize*gridsize*gridsize;
datagrid = calloc(gridsize3, sizeof(float));
if(datagrid == NULL)
{
free(datagrid);
printf("Memory allocation failed\n");
exit(0);
}
for(i=0; i<gridsize3; i++)
{
datagrid[i] += 1.0;
}
sumval_f = 0.0;
sumval_d = 0.0;
for(i=0; i<gridsize3; i++)
{
sumval_f += datagrid[i];
sumval_d += (double)datagrid[i];
}
printf("\ngridsize3 = %e\n", (float)gridsize3);
printf("FLT_MIN = %e\n", FLT_MIN);
printf("FLT_MAX = %e\n", FLT_MAX);
printf("DBL_MIN = %e\n", DBL_MIN);
printf("DBL_MAX = %e\n", DBL_MAX);
printf("\nfloat sum = %f\n", sumval_f);
printf("double sum = %lf\n", sumval_d);
printf("sumval_d/sumval_f = %f\n\n", sumval_d/(double)sumval_f);
free(datagrid);
return(0);
}
Compiling with gcc
I find the output:
gridsize3 = 1.342177e+08
FLT_MIN = 1.175494e-38
FLT_MAX = 3.402823e+38
DBL_MIN = 2.225074e-308
DBL_MAX = 1.797693e+308
float sum = 16777216.000000
double sum = 134217728.000000
sumval_d/sumval_f = 8.000000
Whilst compiling with icc
the sumval_f = 67108864.0
and hence the final ratio is instead 2.0*. Note that the float
sum is incorrect, whilst the double
sum is correct.
As far as I can tell the output of FLT_MAX
suggests that the sum should fit into a float
, and yet it seems to plateau out at either an eighth or a half of the full value.
Is there a compiler specific override to the values found using float.h
?
Why is a double
required to correctly find the sum of this array?
*Interestingly the inclusion of an if statement inside the for loop that prints values of the array causes the value to match the gcc output, i.e. an eighth of the correct sum, rather than a half.
The problem here isn't the range of values but the precision.
Assuming a 32-bit IEEE754 float
, this datatype has a maximum of 24 bits of precision. This means that not all integers larger than 16777216 can be represented exactly.
So when your sum reaches 16777216, adding 1 to it is outside the precision of what the datatype can store, so the number doesn't get any bigger.
A (presumably) 64-bit double
has 53 bits of precision. This is enough bits to hold all integer values up to your sum of 134217728, so it gives you an accurate result.
A float
can precisely represent any integer between -16777215 and +16777215, inclusive. It can also represent all even integers between -2*16777215 and +2*16777215 (including +/- 2*8388608, i.e. 16777216), all multiples of 4 between -4*16777215 and +4*16777215, and likewise for all power-of-two scaling factors up to 2^104 (roughly 2.028E+31). Additionally, it can represent multiples of 1/2 from -16777215/2 to +16777215/2, multiples of 1/4 from -16777215/4 to +16777215/4, etc. down to multiples of 1/2^149 from -167777215/(2^149) to +16777215/(2^149).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With