I am attempting to form a double precision floating point number (64-bit) by taking the ratio of one product of integers divided by another product of integers. I wish to do so in a way that reduces the rounding error.
I am familiar with Kahan summation for addition and subtraction. What techniques work for division?
The numerator is a product of many long values (tens of thousands), likewise the denominator. I wish to prevent overflow and underflow, too. (One application is estimating infinite products by stopping after a sufficient number of terms.)
One thing I have tried is to factor the easily factorable numbers (using trial division by known primes up to a million) and cancel common factors, which helps, but not enough. My errors are approximately 1.0E-13.
I am working in C#, but any code that works with IEEE standard floating point numbers is welcome.
RESEARCH:
I came across a good paper that discusses EFT (Error Free Transformations) for + - x /, Horner's Rule (polynomials), and square root. The title is "4ccurate 4lgorithms in Floating Point 4rithmetic" by Philippe Langlois. See http://www.mathematik.hu-berlin.de/~gaggle/S09/AUTODIFF/projects/papers/langlois_4ccurate_4lgorithms_in_floating_point_4rithmetic.pdf
The above pointed me to Karp and Markstein (for division): https://cr.yp.to/bib/1997/karp.pdf
Integer division rounds towards zero. This means, for example, that the result of the integer division 7 / 5 would have the real number 1.4 as an intermediate result and rounded in the direction of 0 would result in 1 .
A rounding error can be especially problematic when rounded input is used in a series of calculations, causing the error to compound, and sometimes to overpower the calculation. The term "rounding error" is also used sometimes to indicate an amount that is not material to a very large company.
The rounding error is the difference between the actual value and the rounded value, in this case (2.998 - 2.99792458) x 108, which works out to 0.00007542 x 108. Expressed in the correct scientific notation format, that value is 7.542 x 103, which equals 7542 in plain decimal notation.
What techniques work for division?
For the division a/b
, you can evaluate the residue (remainder):
a = b*q + r
This remainder r
is easily accessible if you have fused-multiply-add
q = a/b ;
r = fma(b,q,-a) ;
The same fma trick can be applied on multiplication:
y = a*b ;
r = fma(a,b,-y) ; // the result is y+r
Then if you end up with two approximate operands after the products (a0+ra) / (b0+rb)
, you are interested in (a0+ra) = q*(b0+rb) + r
.
You can first evaluate:
q0 = a0/b0 ;
r0 = fma(b0,q0,-a0);
Then approximate the remainder as:
r = fma(q0,rb,r0-ra);
Then correct the quotient as:
q = q0 + r/b0;
EDIT: What if fma is not available?
We can emulate the fma using an exact product à la Dekker, which is decomposed into the exact sum of 2 floating point, then a Boldo-Melquiond roundToOdd trick to be sure to have the sum of 3 floating point exactly rounded.
But it's going to be overkill. We use fma only for evaluating the residual error, so we generally have c very close to -ab. In this case, ab+c is exact, and we only have 2 floating points to sum, not 3.
Anyway, we only roughly estimate the residual error of a bunch of operations, so the last bit of this residue would not have been that important.
So the fma can be written like this:
/* extract the high 26 bits of significand */
double upperHalf( double x ) {
double secator = 134217729.0; /* 1<<27+1 */
double p = x * secator; /* simplified... normally we should check if overflow and scale down */
return p + (x - p);
}
/* emulate a fused multiply add: roundToNearestFloat(a*b+c)
Beware: use only when -c is an approximation of a*b
otherwise there is NO guaranty of correct rounding */
double emulated_fma(a,b,c) {
double aup = upperHalf(a);
double alo = a-aup;
double bup = upperHalf(b);
double blo = b-bup;
/* compute exact product of a and b
which is the exact sum of ab and a residual error resab */
double high = aup*bup;
double mid = aup*blo + alo*bup;
double low = alo*blo;
double ab = high + mid;
double resab = (high - ab) + mid + low;
double fma = ab + c; /* expected to be exact, so don't bother with residual error */
return resab + fma;
}
Well, a bit less overkill than the general emulated fma, but it might be more clever to use a language that provides native fma for this part of the job...
The multiplication equivalent of Kahan summation you are looking for is “double-double multiplication”. Here, if your integers are representable as double
values, the function Mul122
from crlibm is mostly enough.
#define Mul122(resh,resl,a,bh,bl) \
{ \
double _t1, _t2, _t3, _t4; \
\
Mul12(&_t1,&_t2,(a),(bh)); \
_t3 = (a) * (bl); \
_t4 = _t2 + _t3; \
Add12((*(resh)),(*(resl)),_t1,_t4); \
}
bh
and bl
are the running product stored with additional precision as the sum of two double
values. a
is the next integer (we are assuming it is exactly converted to a double
). resh
and resl
receive the next running product, in which the factor a
has been taken into account.
In order to avoid underflow and overflow, you can externalize the exponent to an integer of the width you wish. This is done by periodically applying the frexp
function to the high part of the running product, and then normalizing the running product by dividing both components by the same power of two (tracking the total power of two by which the running product has been divided can be done on the side with an integer variable of the desired width).
How often to apply frexp
depends on the bound you have on the integers you are multiplying. If the integers are below 253, which would help with them being exactly representable as double
values, you can do about 19 multiplications before having to normalize the running product, because the double-precision exponent goes up to 1023.
Once you have computed the products corresponding to the numerator and to the denominator, throw away the low components, and divide the high components. This will only introduce an error of about 1ULP. You were not aiming for an error of less than a double-precision ULP, were you?
Do not forget the powers of two that you left on the side for both numerator and denominator! Subtract them and apply the difference to the quotient with the ldexp
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With