Floating point accuracy again

Question

Yesterday I asked a question about why I was losing accuracy in a floating point arithmetic. I received an answer about how it was due to intermediate results being held in x87 registers. This was helpful but some of the details are still escaping me. Here's a variation of the program I presented in the previous question, I'm using VC++ 2010 Express in debug mode.

int main()
{
    double x = 1.8939201459282359e-308; /* subnormal number */
    double tiny = 4.9406564584124654e-324; /* smallest IEEE double */
    double scale = 1.6;
    double temp = scale*tiny;
    printf("%23.16e
", x + temp);
    printf("%23.16e
", x + scale*tiny);
}

This outputs

1.8939201459282369e-308
1.8939201459282364e-308

The first value is correct according to the IEEE standard. Giving the scale variable a value of 2.0 gives the correct value for both computations. I understand that temp in the first calculation is a subnormal value and therefore loses precision. I also understand that the value of scale*tiny is held in a x87 register which has a greater exponent range and so this value has more precision than temp. What I don't understand is when adding the value to x we get the correct answer from the lower precision value. Surely if a lower precision value can give the correct answer then a higher precision value should also give the correct answer? Is this something to do with 'double rounding'?

Thanks in advance, this is a whole new subject for me, so I'm struggling a little.

Daniel Fischer · Accepted Answer

The point is that due to the larger exponent range, the two numbers are not subnormal in the x87 representation.

In the IEEE754 representation,

x    = 0.d9e66553db96f × 2^(-1022)
tiny = 0.0000000000001 × 2^(-1022)

but in the x87 representation,

x    = 1.b3cccaa7b72de × 2^(-1023)
tiny = 1.0000000000000 × 2^(-1074)

Now, when 1.6*tiny is computed in the IEEE754 representation, it is rounded to 0.0000000000002 × 2^(-1022) since that is the closest representable number to the mathematical result. Adding that to x results in

  0.d9e66553db96f × 2^(-1022)
+ 0.0000000000002 × 2^(-1022)
-----------------------------
  0.d9e66553db971 × 2^(-1022)

But in the x87 representation, 1.6*tiny becomes

1.999999999999a × 2^(-1074)

and when that is added

  1.b3cccaa7b72de × 2^(-1023)
+ 0.0000000000003333333333334 × 2^(-1023)
-----------------------------------------
  1.b3cccaa7b72e1333333333334 × 2^(-1023)

the result rounded to 53 significant bits is

  1.b3cccaa7b72e1 × 2^(-1023)

with last bit in the significand 1. If that is then converted to IEEE754 representation (where it can have at most 52 bits in the significand because it's a subnormal number), since it is exactly half way between the two adjacent representable numbers 0.d9e66553db970 × 2^(-1022) and 0.d9e66553db971 × 2^(-1022) it is by default rounded to the one with last bit in the significand zero.

Note that if the FPU were not configured to use only 53 bits for the significand, but the full 64 of the x87 extended precision type, the result of the addition would be closer to the IEEE754 result 0.d9e66553db971 × 2^(-1022) and hence rounded to that.

Effectively, since the x87 representation has a larger exponent range, you have more bits for the significands of IEEE754-subnormal numbers than in the IEEE754 representation even with restricted number of bits in the significand. Thus the result of the computation has one more significant bit here in x87 than in IEEE754.

Floating point accuracy again

Tags:

c

floating-point

floating-accuracy

ieee-754

floating-point-precision

john

1 Answers

Daniel Fischer

Recent Activity

Donate For Us

Floating point accuracy again

Tags:

c

floating-point

floating-accuracy

ieee-754

floating-point-precision

john

1 Answers

Daniel Fischer

Related questions

Recent Activity

Donate For Us