Here are two implementations of interpolation functions. Argument <code>u1</code> is always between <code>0.</code> and <code>1.</code>. <pre class="prettyprint"><code>#include <stdio.h> double interpol_64(double u1, double u2, double u3) { return u2 * (1.0 - u1) + u1 * u3; } double interpol_80(double u1, double u2, double u3) { return u2 * (1.0 - (long double)u1) + u1 * (long double)u3; } int main() { double y64,y80,u1,u2,u3; u1 = 0.025; u2 = 0.195; u3 = 0.195; y64 = interpol_64(u1, u2, u3); y80 = interpol_80(u1, u2, u3); printf("u2: %a\ny64:%a\ny80:%a\n", u2, y64, y80); } </code></pre> On a strict IEEE 754 platform with 80-bit <code>long double</code>s, all computations in <code>interpol_64()</code> are done according to IEEE 754 double precision, and in <code>interpol_80()</code> in 80-bit extended precision. The program prints: <pre class="prettyprint"><code>u2: 0x1.8f5c28f5c28f6p-3 y64:0x1.8f5c28f5c28f5p-3 y80:0x1.8f5c28f5c28f6p-3 </code></pre> I am interested in the property “the result returned by the function is always in-between <code>u2</code> and <code>u3</code>”. This property is false of <code>interpol_64()</code>, as shown by the values in the <code>main()</code> above. Does the property have a chance to be true of <code>interpol_80()</code>? If it isn't, what is a counter-example? Does it help if we know that <code>u2 != u3</code> or that there is a minimum distance between them? Is there a method to determine a significand width for intermediate computations at which the property would be guaranteed to be true? EDIT: on all the random values I tried, the property held when intermediate computations were done in extended precision internally. If <code>interpol_80()</code> took <code>long double</code> arguments, it would be relatively easy to build a counter-example, but the question here is specifically about a function that takes <code>double</code> arguments. This makes it much harder to build a counter-example, if there is one. <hr> Note: a compiler generating x87 instructions may generate the same code for <code>interpol_64()</code> and <code>interpol_80()</code>, but this is tangential to my question.

Yes, interpol_80() is safe, let's demonstrate it. The problem states that inputs are 64bits float <pre class="prettyprint"><code>rnd64(ui) = ui </code></pre> The result is exactly (assuming * and + are mathematical operations) <pre class="prettyprint"><code>r = u2*(1-u1)+(u1*u3) </code></pre> Optimal return value rounded to 64 bit float is <pre class="prettyprint"><code>r64 = rnd64(r) </code></pre> As we have these properties <pre class="prettyprint"><code>u2 <= r <= u3 </code></pre> It is guaranteed that <pre class="prettyprint"><code>rnd64(u2) <= rnd64(r) <= rnd64(u3) u2 <= r64 <= u3 </code></pre> Conversion to 80bits of u1,u2,u3 are exact too. <pre class="prettyprint"><code>rnd80(ui)=ui </code></pre> Now, let's assume <code>0 <= u2 <= u3</code>, then performing with inexact float operations leads to at most 4 rounding errors: <pre class="prettyprint"><code>rf = rnd(rnd(u2*rnd(1-u1)) + rnd(u1*u3)) </code></pre> Assuming round to nearest even, this will be at most 2 ULP off exact value. If rounding is performed with 64 bits float or 80 bits floats: <pre class="prettyprint"><code>r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r) r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r) </code></pre> <code>rf64</code> can be off by 2 ulp so interpol-64() is unsafe, but what about <code>rnd64( rf80 )</code>? We can tell that: <pre class="prettyprint"><code>rnd64(r - 2 ulp80(r)) <= rnd64(rf80) <= rnd64(r + 2 ulp80(r)) </code></pre> Since <code>0 <= u2 <= u3</code>, then <pre class="prettyprint"><code>ulp80(u2) <= ulp80(r) <= ulp80(r3) rnd64(u2 - 2 ulp80(u2)) <= rnd64(r - 2 ulp80(r)) <= rnd64(rf80) rnd64(u3 + 2 ulp80(u3)) >= rnd64(r + 2 ulp80(r)) >= rnd64(rf80) </code></pre> Fortunately, like every number in range <code>(u2-ulp64(u2)/2 , u2+ulp64(u2)/2)</code> we get <pre class="prettyprint"><code>rnd64(u2 - 2 ulp80(u2)) = u2 rnd64(u3 + 2 ulp80(u3)) = u3 </code></pre> since <code>ulp80(x)=ulp62(x)/2^(64-53)</code> We thus get the proof <pre class="prettyprint"><code>u2 <= rnd64(rf80) <= u3 </code></pre> For u2 <= u3 <= 0, we can apply same proof easily. The last case to be studied is u2 <= 0 <= u3. If we subtract 2 big values, then result can be up to ulp(big)/2 off rather than ulp(big-big)/2... Thus this assertion we made doesn't hold anymore: <pre class="prettyprint"><code>r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r) </code></pre> Fortunately, <code>u2 <= u2*(1-u1) <= 0 <= u1*u3 <= u3</code> and this is preserved after rounding <pre class="prettyprint"><code>u2 <= rnd(u2*rnd(1-u1)) <= 0 <= rnd(u1*u3) <= u3 </code></pre> Thus since added quantities are of opposite sign: <pre class="prettyprint"><code>u2 <= rnd(u2*rnd(1-u1)) + rnd(u1*u3) <= u3 </code></pre> same goes after rounding, so we can once again guaranty <pre class="prettyprint"><code>u2 <= rnd64( rf80 ) <= u3 </code></pre> QED To be complete we should care of denormal inputs (gradual underflow), but I hope you won't be that vicious with stress tests. I won't demonstrate what happens with those... EDIT: Here is a follow-up as the following assertion was a bit approximative and generated some comments when 0 <= u2 <= u3 <pre class="prettyprint"><code>r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r) </code></pre> We can write the following inequalities: <pre class="prettyprint"><code>rnd(1-u1) <= 1 rnd(1-u1) <= 1-u1+ulp(1)/4 u2*rnd(1-u1) <= u2 <= r u2*rnd(1-u1) <= u2*(1-u1)+u2*ulp(1)/4 u2*ulp(1) < 2*ulp(u2) <= 2*ulp(r) u2*rnd(1-u1) < u2*(1-u1)+ulp(r)/2 </code></pre> For next rounding operation, we use <pre class="prettyprint"><code>ulp(u2*rnd(1-u1)) <= ulp(r) rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(u2*rnd(1-u1))/2 rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(r)/2 rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r) </code></pre> For second part of the sum, we have: <pre class="prettyprint"><code>u1*u3 <= r rnd(u1*u3) <= u1*u3 + ulp(u1*u3)/2 rnd(u1*u3) <= u1*u3 + ulp(r)/2 rnd(u2*rnd(1-u1))+rnd(u1*u3) < u2*(1-u1)+u1*u3 + 3*ulp(r)/2 rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 3*ulp(r)/2 + ulp(r+3*ulp(r)/2)/2 ulp(r+3*ulp(r)/2) <= 2*ulp(r) rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 5*ulp(r)/2 </code></pre> I didn't prove the original claim, but not that far...

Properties of 80-bit extended precision computations starting from double precision arguments

Tags:

c

floating-point

ieee-754

extended-precision

Here are two implementations of interpolation functions. Argument u1 is always between 0. and 1..

#include <stdio.h>

double interpol_64(double u1, double u2, double u3)
{ 
  return u2 * (1.0 - u1) + u1 * u3;  
}

double interpol_80(double u1, double u2, double u3)
{ 
  return u2 * (1.0 - (long double)u1) + u1 * (long double)u3;  
}

int main()
{
  double y64,y80,u1,u2,u3;
  u1 = 0.025;
  u2 = 0.195;
  u3 = 0.195;
  y64 = interpol_64(u1, u2, u3);
  y80 = interpol_80(u1, u2, u3);
  printf("u2: %a\ny64:%a\ny80:%a\n", u2, y64, y80);
}

On a strict IEEE 754 platform with 80-bit long doubles, all computations in interpol_64() are done according to IEEE 754 double precision, and in interpol_80() in 80-bit extended precision. The program prints:

u2: 0x1.8f5c28f5c28f6p-3
y64:0x1.8f5c28f5c28f5p-3
y80:0x1.8f5c28f5c28f6p-3

I am interested in the property “the result returned by the function is always in-between u2 and u3”. This property is false of interpol_64(), as shown by the values in the main() above.

Does the property have a chance to be true of interpol_80()? If it isn't, what is a counter-example? Does it help if we know that u2 != u3 or that there is a minimum distance between them? Is there a method to determine a significand width for intermediate computations at which the property would be guaranteed to be true?

EDIT: on all the random values I tried, the property held when intermediate computations were done in extended precision internally. If interpol_80() took long double arguments, it would be relatively easy to build a counter-example, but the question here is specifically about a function that takes double arguments. This makes it much harder to build a counter-example, if there is one.

Note: a compiler generating x87 instructions may generate the same code for interpol_64() and interpol_80(), but this is tangential to my question.

513

asked Dec 05 '12 14:12

Pascal Cuoq

2 Answers

Yes, interpol_80() is safe, let's demonstrate it.

The problem states that inputs are 64bits float

rnd64(ui) = ui

The result is exactly (assuming * and + are mathematical operations)

r = u2*(1-u1)+(u1*u3)

Optimal return value rounded to 64 bit float is

r64 = rnd64(r)

As we have these properties

u2 <= r <= u3

It is guaranteed that

rnd64(u2) <= rnd64(r) <= rnd64(u3)
u2 <= r64 <= u3

Conversion to 80bits of u1,u2,u3 are exact too.

rnd80(ui)=ui

Now, let's assume 0 <= u2 <= u3, then performing with inexact float operations leads to at most 4 rounding errors:

rf = rnd(rnd(u2*rnd(1-u1)) + rnd(u1*u3))

Assuming round to nearest even, this will be at most 2 ULP off exact value. If rounding is performed with 64 bits float or 80 bits floats:

r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)
r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)

rf64 can be off by 2 ulp so interpol-64() is unsafe, but what about rnd64( rf80 )?
We can tell that:

rnd64(r - 2 ulp80(r)) <= rnd64(rf80) <= rnd64(r + 2 ulp80(r))

Since 0 <= u2 <= u3, then

ulp80(u2) <= ulp80(r) <= ulp80(r3)
rnd64(u2 - 2 ulp80(u2)) <= rnd64(r - 2 ulp80(r)) <= rnd64(rf80)
rnd64(u3 + 2 ulp80(u3)) >= rnd64(r + 2 ulp80(r)) >= rnd64(rf80)

Fortunately, like every number in range (u2-ulp64(u2)/2 , u2+ulp64(u2)/2) we get

rnd64(u2 - 2 ulp80(u2)) = u2
rnd64(u3 + 2 ulp80(u3)) = u3

since ulp80(x)=ulp62(x)/2^(64-53)

We thus get the proof

u2 <= rnd64(rf80) <= u3

For u2 <= u3 <= 0, we can apply same proof easily.

The last case to be studied is u2 <= 0 <= u3. If we subtract 2 big values, then result can be up to ulp(big)/2 off rather than ulp(big-big)/2...
Thus this assertion we made doesn't hold anymore:

r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)

Fortunately, u2 <= u2*(1-u1) <= 0 <= u1*u3 <= u3 and this is preserved after rounding

u2 <= rnd(u2*rnd(1-u1)) <= 0 <= rnd(u1*u3) <= u3

Thus since added quantities are of opposite sign:

u2 <= rnd(u2*rnd(1-u1)) + rnd(u1*u3) <= u3

same goes after rounding, so we can once again guaranty

u2 <= rnd64( rf80 ) <= u3

QED

To be complete we should care of denormal inputs (gradual underflow), but I hope you won't be that vicious with stress tests. I won't demonstrate what happens with those...

EDIT:

Here is a follow-up as the following assertion was a bit approximative and generated some comments when 0 <= u2 <= u3

r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)

We can write the following inequalities:

rnd(1-u1) <= 1
rnd(1-u1) <= 1-u1+ulp(1)/4
u2*rnd(1-u1) <= u2 <= r
u2*rnd(1-u1) <= u2*(1-u1)+u2*ulp(1)/4
u2*ulp(1) < 2*ulp(u2) <= 2*ulp(r)
u2*rnd(1-u1) < u2*(1-u1)+ulp(r)/2

For next rounding operation, we use

ulp(u2*rnd(1-u1)) <= ulp(r)
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(u2*rnd(1-u1))/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(r)/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)

For second part of the sum, we have:

u1*u3 <= r
rnd(u1*u3) <= u1*u3 + ulp(u1*u3)/2
rnd(u1*u3) <= u1*u3 + ulp(r)/2

rnd(u2*rnd(1-u1))+rnd(u1*u3) < u2*(1-u1)+u1*u3 + 3*ulp(r)/2
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 3*ulp(r)/2 + ulp(r+3*ulp(r)/2)/2
ulp(r+3*ulp(r)/2) <= 2*ulp(r)
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 5*ulp(r)/2

I didn't prove the original claim, but not that far...

131

answered Oct 17 '22 20:10

aka.nice

The main source of loss-of-precision in interpol_64 is the multiplications. Multiplying two 53-bit mantissas yields a 105- or 106-bit (depending on whether the high bit carries) mantissa. This is too large to fit in an 80-bit extended precision value, so in general, you'll also have loss-of-precision in the 80-bit version. Quantifying exactly when it happens is very difficult; the most that's easy to say is that it happens when rounding errors accumulate. Note that there's also a small rounding step when adding the two terms.

Most people would probably just solve this problem with a function like:

double interpol_64(double u1, double u2, double u3)
{ 
  return u2 + u1 * (u3 - u2);
}

But it looks like you're looking for insight into the rounding issues, not a better implementation.

answered Oct 17 '22 21:10

R.. GitHub STOP HELPING ICE

Related questions
                            
                                Importing a PKCS12 Using SecItemImport
                            
                                Is there a particular reason for memmem being a GNU extension?
                            
                                Why is isascii() deprecated?
                            
                                What throttles the fwrite() calls to a full disk on linux?
                            
                                Pointer Subtraction and an Alternative
                            
                                JNI wrapper for C function using SWIG - what should be the typemap?
                            
                                Opaque types allocatable on stack in C
                            
                                Complex numbers passed by-value from C++ to C does not seem to work on powerpc
                            
                                How to use autotools nobase and nodist prefixs together on include_HEADERS
                            
                                How to query amount of allocated memory on Linux (and OSX)?
                            
                                Why is netcat unable to receive the second broadcast message?
                            
                                How do I turn off '*' in multi-line comments in Eclipse?
                            
                                How to send Ctrl-C control character or terminal hangup message to child process?
                            
                                ncurses transparent console background
                            
                                C++ queue with dependencies
                            
                                Is the data in siginfo trustworthy?
                            
                                Is explicitly clearing/zeroing sensitive variables after use sensible?
                            
                                Compiling functional languages to C
                            
                                Are there any C# to C converter tools? [closed]
                            
                                Naming conventions for Ruby C extension developers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With