Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Properties of 80-bit extended precision computations starting from double precision arguments

Here are two implementations of interpolation functions. Argument u1 is always between 0. and 1..

#include <stdio.h>

double interpol_64(double u1, double u2, double u3)
{ 
  return u2 * (1.0 - u1) + u1 * u3;  
}

double interpol_80(double u1, double u2, double u3)
{ 
  return u2 * (1.0 - (long double)u1) + u1 * (long double)u3;  
}

int main()
{
  double y64,y80,u1,u2,u3;
  u1 = 0.025;
  u2 = 0.195;
  u3 = 0.195;
  y64 = interpol_64(u1, u2, u3);
  y80 = interpol_80(u1, u2, u3);
  printf("u2: %a\ny64:%a\ny80:%a\n", u2, y64, y80);
}

On a strict IEEE 754 platform with 80-bit long doubles, all computations in interpol_64() are done according to IEEE 754 double precision, and in interpol_80() in 80-bit extended precision. The program prints:

u2: 0x1.8f5c28f5c28f6p-3
y64:0x1.8f5c28f5c28f5p-3
y80:0x1.8f5c28f5c28f6p-3

I am interested in the property “the result returned by the function is always in-between u2 and u3”. This property is false of interpol_64(), as shown by the values in the main() above.

Does the property have a chance to be true of interpol_80()? If it isn't, what is a counter-example? Does it help if we know that u2 != u3 or that there is a minimum distance between them? Is there a method to determine a significand width for intermediate computations at which the property would be guaranteed to be true?

EDIT: on all the random values I tried, the property held when intermediate computations were done in extended precision internally. If interpol_80() took long double arguments, it would be relatively easy to build a counter-example, but the question here is specifically about a function that takes double arguments. This makes it much harder to build a counter-example, if there is one.


Note: a compiler generating x87 instructions may generate the same code for interpol_64() and interpol_80(), but this is tangential to my question.

like image 513
Pascal Cuoq Avatar asked Dec 05 '12 14:12

Pascal Cuoq


People also ask

How many bits is double precision?

The XDR standard defines the encoding for the double-precision floating-point data type as a double. The length of a double is 64 bits or 8 bytes.

What is extended precision format?

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format.

What are double precision values?

Double precision is an inexact, variable-precision numeric type. In other words, some values cannot be represented exactly and are stored as approximations. Thus, input and output operations involving double precision might show slight discrepancies.

How many bits are used in extended precision floating-point format respectively?

For extended precision : It requires 80 bits.


2 Answers

Yes, interpol_80() is safe, let's demonstrate it.

The problem states that inputs are 64bits float

rnd64(ui) = ui

The result is exactly (assuming * and + are mathematical operations)

r = u2*(1-u1)+(u1*u3)

Optimal return value rounded to 64 bit float is

r64 = rnd64(r)

As we have these properties

u2 <= r <= u3

It is guaranteed that

rnd64(u2) <= rnd64(r) <= rnd64(u3)
u2 <= r64 <= u3

Conversion to 80bits of u1,u2,u3 are exact too.

rnd80(ui)=ui

Now, let's assume 0 <= u2 <= u3, then performing with inexact float operations leads to at most 4 rounding errors:

rf = rnd(rnd(u2*rnd(1-u1)) + rnd(u1*u3))

Assuming round to nearest even, this will be at most 2 ULP off exact value. If rounding is performed with 64 bits float or 80 bits floats:

r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)
r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)

rf64 can be off by 2 ulp so interpol-64() is unsafe, but what about rnd64( rf80 )?
We can tell that:

rnd64(r - 2 ulp80(r)) <= rnd64(rf80) <= rnd64(r + 2 ulp80(r))

Since 0 <= u2 <= u3, then

ulp80(u2) <= ulp80(r) <= ulp80(r3)
rnd64(u2 - 2 ulp80(u2)) <= rnd64(r - 2 ulp80(r)) <= rnd64(rf80)
rnd64(u3 + 2 ulp80(u3)) >= rnd64(r + 2 ulp80(r)) >= rnd64(rf80)

Fortunately, like every number in range (u2-ulp64(u2)/2 , u2+ulp64(u2)/2) we get

rnd64(u2 - 2 ulp80(u2)) = u2
rnd64(u3 + 2 ulp80(u3)) = u3

since ulp80(x)=ulp62(x)/2^(64-53)

We thus get the proof

u2 <= rnd64(rf80) <= u3

For u2 <= u3 <= 0, we can apply same proof easily.

The last case to be studied is u2 <= 0 <= u3. If we subtract 2 big values, then result can be up to ulp(big)/2 off rather than ulp(big-big)/2...
Thus this assertion we made doesn't hold anymore:

r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)

Fortunately, u2 <= u2*(1-u1) <= 0 <= u1*u3 <= u3 and this is preserved after rounding

u2 <= rnd(u2*rnd(1-u1)) <= 0 <= rnd(u1*u3) <= u3

Thus since added quantities are of opposite sign:

u2 <= rnd(u2*rnd(1-u1)) + rnd(u1*u3) <= u3

same goes after rounding, so we can once again guaranty

u2 <= rnd64( rf80 ) <= u3

QED

To be complete we should care of denormal inputs (gradual underflow), but I hope you won't be that vicious with stress tests. I won't demonstrate what happens with those...

EDIT:

Here is a follow-up as the following assertion was a bit approximative and generated some comments when 0 <= u2 <= u3

r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)

We can write the following inequalities:

rnd(1-u1) <= 1
rnd(1-u1) <= 1-u1+ulp(1)/4
u2*rnd(1-u1) <= u2 <= r
u2*rnd(1-u1) <= u2*(1-u1)+u2*ulp(1)/4
u2*ulp(1) < 2*ulp(u2) <= 2*ulp(r)
u2*rnd(1-u1) < u2*(1-u1)+ulp(r)/2

For next rounding operation, we use

ulp(u2*rnd(1-u1)) <= ulp(r)
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(u2*rnd(1-u1))/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(r)/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)

For second part of the sum, we have:

u1*u3 <= r
rnd(u1*u3) <= u1*u3 + ulp(u1*u3)/2
rnd(u1*u3) <= u1*u3 + ulp(r)/2

rnd(u2*rnd(1-u1))+rnd(u1*u3) < u2*(1-u1)+u1*u3 + 3*ulp(r)/2
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 3*ulp(r)/2 + ulp(r+3*ulp(r)/2)/2
ulp(r+3*ulp(r)/2) <= 2*ulp(r)
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 5*ulp(r)/2

I didn't prove the original claim, but not that far...

like image 131
aka.nice Avatar answered Oct 17 '22 20:10

aka.nice


The main source of loss-of-precision in interpol_64 is the multiplications. Multiplying two 53-bit mantissas yields a 105- or 106-bit (depending on whether the high bit carries) mantissa. This is too large to fit in an 80-bit extended precision value, so in general, you'll also have loss-of-precision in the 80-bit version. Quantifying exactly when it happens is very difficult; the most that's easy to say is that it happens when rounding errors accumulate. Note that there's also a small rounding step when adding the two terms.

Most people would probably just solve this problem with a function like:

double interpol_64(double u1, double u2, double u3)
{ 
  return u2 + u1 * (u3 - u2);
}

But it looks like you're looking for insight into the rounding issues, not a better implementation.

like image 43
R.. GitHub STOP HELPING ICE Avatar answered Oct 17 '22 21:10

R.. GitHub STOP HELPING ICE