Understanding the maximum values that can be stored in floats in C

Question

I have come across some behaviour with the float type in C that I do not understand, and was hoping might be explained. Using the macros defined in float.h I can determine the maximum/minimum values that the datatype can store on the given hardware. However when performing a calculation that should not exceed these limits, I find that a typed float variable fails where a double succeeds. The following is a minimal example, which compiles on my machine.

#include <stdio.h>
#include <stdlib.h>
#include <float.h>

int main(int argc, char **argv)
{
    int gridsize;
    long gridsize3;

    float *datagrid;

    float sumval_f;
    double sumval_d;

    long i;

    gridsize = 512;
    gridsize3 = (long)gridsize*gridsize*gridsize;

    datagrid = calloc(gridsize3, sizeof(float));
    if(datagrid == NULL)
    {
        free(datagrid);
        printf("Memory allocation failed
");
        exit(0);
    }

    for(i=0; i<gridsize3; i++)
    {
        datagrid[i] += 1.0;
    }

    sumval_f = 0.0;
    sumval_d = 0.0;
    for(i=0; i<gridsize3; i++)
    {
        sumval_f += datagrid[i];
        sumval_d += (double)datagrid[i];
    }

    printf("
gridsize3 = %e
", (float)gridsize3);
    printf("FLT_MIN = %e
", FLT_MIN);
    printf("FLT_MAX = %e
", FLT_MAX);
    printf("DBL_MIN = %e
", DBL_MIN);
    printf("DBL_MAX = %e
", DBL_MAX);

    printf("
float sum = %f
", sumval_f);
    printf("double sum = %lf
", sumval_d);
    printf("sumval_d/sumval_f = %f

", sumval_d/(double)sumval_f);

    free(datagrid);
    return(0);
}

Compiling with gcc I find the output:

gridsize3 = 1.342177e+08
FLT_MIN = 1.175494e-38
FLT_MAX = 3.402823e+38
DBL_MIN = 2.225074e-308
DBL_MAX = 1.797693e+308

float sum = 16777216.000000
double sum = 134217728.000000
sumval_d/sumval_f = 8.000000

Whilst compiling with icc the sumval_f = 67108864.0 and hence the final ratio is instead 2.0*. Note that the float sum is incorrect, whilst the double sum is correct.

As far as I can tell the output of FLT_MAX suggests that the sum should fit into a float, and yet it seems to plateau out at either an eighth or a half of the full value.

Is there a compiler specific override to the values found using float.h? Why is a double required to correctly find the sum of this array?

_{*Interestingly the inclusion of an if statement inside the for loop that prints values of the array causes the value to match the gcc output, i.e. an eighth of the correct sum, rather than a half.}

dbush · Accepted Answer

The problem here isn't the range of values but the precision.

Assuming a 32-bit IEEE754 float, this datatype has a maximum of 24 bits of precision. This means that not all integers larger than 16777216 can be represented exactly.

So when your sum reaches 16777216, adding 1 to it is outside the precision of what the datatype can store, so the number doesn't get any bigger.

A (presumably) 64-bit double has 53 bits of precision. This is enough bits to hold all integer values up to your sum of 134217728, so it gives you an accurate result.

supercat · Answer

A float can precisely represent any integer between -16777215 and +16777215, inclusive. It can also represent all even integers between -2*16777215 and +2*16777215 (including +/- 2*8388608, i.e. 16777216), all multiples of 4 between -4*16777215 and +4*16777215, and likewise for all power-of-two scaling factors up to 2^104 (roughly 2.028E+31). Additionally, it can represent multiples of 1/2 from -16777215/2 to +16777215/2, multiples of 1/4 from -16777215/4 to +16777215/4, etc. down to multiples of 1/2^149 from -167777215/(2^149) to +16777215/(2^149).

Understanding the maximum values that can be stored in floats in C

Tags:

c

floating-point

2 Answers

dbush

supercat

Recent Activity

Donate For Us

Understanding the maximum values that can be stored in floats in C

Tags:

c

floating-point

2 Answers

dbush

supercat

Related questions

Recent Activity

Donate For Us