Here are two implementations of interpolation functions. Argument u1
is always between 0.
and 1.
.
#include <stdio.h>
double interpol_64(double u1, double u2, double u3)
{
return u2 * (1.0 - u1) + u1 * u3;
}
double interpol_80(double u1, double u2, double u3)
{
return u2 * (1.0 - (long double)u1) + u1 * (long double)u3;
}
int main()
{
double y64,y80,u1,u2,u3;
u1 = 0.025;
u2 = 0.195;
u3 = 0.195;
y64 = interpol_64(u1, u2, u3);
y80 = interpol_80(u1, u2, u3);
printf("u2: %a\ny64:%a\ny80:%a\n", u2, y64, y80);
}
On a strict IEEE 754 platform with 80-bit long double
s, all computations in interpol_64()
are done according to IEEE 754 double precision, and in interpol_80()
in 80-bit extended precision.
The program prints:
u2: 0x1.8f5c28f5c28f6p-3
y64:0x1.8f5c28f5c28f5p-3
y80:0x1.8f5c28f5c28f6p-3
I am interested in the property “the result returned by the function is always in-between u2
and u3
”. This property is false of interpol_64()
, as shown by the values in the main()
above.
Does the property have a chance to be true of interpol_80()
? If it isn't, what is a counter-example? Does it help if we know that u2 != u3
or that there is a minimum distance between them? Is there a method to determine a significand width for intermediate computations at which the property would be guaranteed to be true?
EDIT: on all the random values I tried, the property held when intermediate computations were done in extended precision internally. If interpol_80()
took long double
arguments, it would be relatively easy to build a counter-example, but the question here is specifically about a function that takes double
arguments. This makes it much harder to build a counter-example, if there is one.
Note: a compiler generating x87 instructions may generate the same code for interpol_64()
and interpol_80()
, but this is tangential to my question.
The XDR standard defines the encoding for the double-precision floating-point data type as a double. The length of a double is 64 bits or 8 bytes.
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format.
Double precision is an inexact, variable-precision numeric type. In other words, some values cannot be represented exactly and are stored as approximations. Thus, input and output operations involving double precision might show slight discrepancies.
For extended precision : It requires 80 bits.
Yes, interpol_80() is safe, let's demonstrate it.
The problem states that inputs are 64bits float
rnd64(ui) = ui
The result is exactly (assuming * and + are mathematical operations)
r = u2*(1-u1)+(u1*u3)
Optimal return value rounded to 64 bit float is
r64 = rnd64(r)
As we have these properties
u2 <= r <= u3
It is guaranteed that
rnd64(u2) <= rnd64(r) <= rnd64(u3)
u2 <= r64 <= u3
Conversion to 80bits of u1,u2,u3 are exact too.
rnd80(ui)=ui
Now, let's assume 0 <= u2 <= u3
, then performing with inexact float operations leads to at most 4 rounding errors:
rf = rnd(rnd(u2*rnd(1-u1)) + rnd(u1*u3))
Assuming round to nearest even, this will be at most 2 ULP off exact value. If rounding is performed with 64 bits float or 80 bits floats:
r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)
r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)
rf64
can be off by 2 ulp so interpol-64() is unsafe, but what about rnd64( rf80 )
?
We can tell that:
rnd64(r - 2 ulp80(r)) <= rnd64(rf80) <= rnd64(r + 2 ulp80(r))
Since 0 <= u2 <= u3
, then
ulp80(u2) <= ulp80(r) <= ulp80(r3)
rnd64(u2 - 2 ulp80(u2)) <= rnd64(r - 2 ulp80(r)) <= rnd64(rf80)
rnd64(u3 + 2 ulp80(u3)) >= rnd64(r + 2 ulp80(r)) >= rnd64(rf80)
Fortunately, like every number in range (u2-ulp64(u2)/2 , u2+ulp64(u2)/2)
we get
rnd64(u2 - 2 ulp80(u2)) = u2
rnd64(u3 + 2 ulp80(u3)) = u3
since ulp80(x)=ulp62(x)/2^(64-53)
We thus get the proof
u2 <= rnd64(rf80) <= u3
For u2 <= u3 <= 0, we can apply same proof easily.
The last case to be studied is u2 <= 0 <= u3. If we subtract 2 big values, then result can be up to ulp(big)/2 off rather than ulp(big-big)/2...
Thus this assertion we made doesn't hold anymore:
r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)
Fortunately, u2 <= u2*(1-u1) <= 0 <= u1*u3 <= u3
and this is preserved after rounding
u2 <= rnd(u2*rnd(1-u1)) <= 0 <= rnd(u1*u3) <= u3
Thus since added quantities are of opposite sign:
u2 <= rnd(u2*rnd(1-u1)) + rnd(u1*u3) <= u3
same goes after rounding, so we can once again guaranty
u2 <= rnd64( rf80 ) <= u3
QED
To be complete we should care of denormal inputs (gradual underflow), but I hope you won't be that vicious with stress tests. I won't demonstrate what happens with those...
EDIT:
Here is a follow-up as the following assertion was a bit approximative and generated some comments when 0 <= u2 <= u3
r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)
We can write the following inequalities:
rnd(1-u1) <= 1
rnd(1-u1) <= 1-u1+ulp(1)/4
u2*rnd(1-u1) <= u2 <= r
u2*rnd(1-u1) <= u2*(1-u1)+u2*ulp(1)/4
u2*ulp(1) < 2*ulp(u2) <= 2*ulp(r)
u2*rnd(1-u1) < u2*(1-u1)+ulp(r)/2
For next rounding operation, we use
ulp(u2*rnd(1-u1)) <= ulp(r)
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(u2*rnd(1-u1))/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(r)/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)
For second part of the sum, we have:
u1*u3 <= r
rnd(u1*u3) <= u1*u3 + ulp(u1*u3)/2
rnd(u1*u3) <= u1*u3 + ulp(r)/2
rnd(u2*rnd(1-u1))+rnd(u1*u3) < u2*(1-u1)+u1*u3 + 3*ulp(r)/2
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 3*ulp(r)/2 + ulp(r+3*ulp(r)/2)/2
ulp(r+3*ulp(r)/2) <= 2*ulp(r)
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 5*ulp(r)/2
I didn't prove the original claim, but not that far...
The main source of loss-of-precision in interpol_64
is the multiplications. Multiplying two 53-bit mantissas yields a 105- or 106-bit (depending on whether the high bit carries) mantissa. This is too large to fit in an 80-bit extended precision value, so in general, you'll also have loss-of-precision in the 80-bit version. Quantifying exactly when it happens is very difficult; the most that's easy to say is that it happens when rounding errors accumulate. Note that there's also a small rounding step when adding the two terms.
Most people would probably just solve this problem with a function like:
double interpol_64(double u1, double u2, double u3)
{
return u2 + u1 * (u3 - u2);
}
But it looks like you're looking for insight into the rounding issues, not a better implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With