Accuracy of floating point arithmetic

Question

I'm having trouble understanding the output of this program

int main()
{
    double x = 1.8939201459282359e-308;
    double y = 4.9406564584124654e-324;
    printf("%23.16e
", 1.6*y);
    printf("%23.16e
", 1.7*y);
    printf("%23.16e
", 1.8*y);
    printf("%23.16e
", 1.9*y);
    printf("%23.16e
", 2.0*y);
    printf("%23.16e
", x + 1.6*y);
    printf("%23.16e
", x + 1.7*y);
    printf("%23.16e
", x + 1.8*y);
    printf("%23.16e
", x + 1.9*y);
    printf("%23.16e
", x + 2.0*y);
}

The output is

9.8813129168249309e-324
9.8813129168249309e-324
9.8813129168249309e-324
9.8813129168249309e-324
9.8813129168249309e-324
1.8939201459282364e-308
1.8939201459282364e-308
1.8939201459282369e-308
1.8939201459282369e-308
1.8939201459282369e-308

I'm using IEEE arithmetic. The variable y holds the smallest possible IEEE number. The first five prints show a number which is twice y as I would expect. What is confusing me is that the next five prints show different numbers. If 1.6*y is the same as 2.0*y then how can x + 1.6*y be different from x + 2.0*y?

Pascal Cuoq · Accepted Answer

In a nutshell

You say that your compiler is Visual C++ 2010 Express. I do not have access to this compiler, but I understand that it generates programs that initially configure the x87 CPU to use 53 bits of precision, in order to emulate IEEE 754 double-precision computations as closely as possible.

Unfortunately, “as closely as possible” is not always close enough. Historical 80-bit floating-point registers can have their significand limited in width for the purpose of emulating double-precision, but they always retain a full range for the exponent. The difference shows in particular when manipulating denormals (like your y).

What happens

My explanation would be that in printf("%23.16e ", 1.6*y);, 1.6*y is computed as a 80-bit reduced-significand full-exponent number (it is thus a normal number), then converted to IEEE 754 double-precision (resulting in a denormal), then printed.

On the other hand, in printf("%23.16e ", x + 1.6*y);, x + 1.6*y is computed with all 80-bit reduced-significand full-exponent numbers (again all intermediate results are normal numbers), then converted to IEEE 754 double-precision, then printed.

This would explain why 1.6*y prints the same as 2.0*y but has a different effect when added to x. The number that is printed is a double-precision denormal. The number that is added to x is a 80-bit reduced-significand full-exponent normal number (not the same one).

What happens with other compilers when generating x87 instructions

Other compilers, like GCC, do not configure the x87 FPU to manipulate 53-bit significands. This can have the same consequences (in this case x + 1.6*y would be computed with all 80-bit full significand full exponent numbers, and then converted to double-precision for printing or storing in memory). In this case, the issue is noticeable even more often (you do not need to involve denormals or infinite numbers to notice differences).

This article by David Monniaux contains all the details you may wish for and more.

Removing the unwanted behavior

To get rid of the problem (if you consider it to be one), find the flag that tells your compiler to generate SSE2 instructions for floating-point. These implement exactly IEEE 754 semantics for single- and double-precision.

Accuracy of floating point arithmetic

Tags:

c

floating-point

floating-accuracy

ieee-754

john

1 Answers

In a nutshell

What happens

What happens with other compilers when generating x87 instructions

Removing the unwanted behavior

Pascal Cuoq

Recent Activity

Donate For Us

Accuracy of floating point arithmetic

Tags:

c

floating-point

floating-accuracy

ieee-754

john

1 Answers

In a nutshell

What happens

What happens with other compilers when generating x87 instructions

Removing the unwanted behavior

Pascal Cuoq

Related questions

Recent Activity

Donate For Us