Minimize rounding error when dividing by integers

Tags:

I am attempting to form a double precision floating point number (64-bit) by taking the ratio of one product of integers divided by another product of integers. I wish to do so in a way that reduces the rounding error.

I am familiar with Kahan summation for addition and subtraction. What techniques work for division?

The numerator is a product of many long values (tens of thousands), likewise the denominator. I wish to prevent overflow and underflow, too. (One application is estimating infinite products by stopping after a sufficient number of terms.)

One thing I have tried is to factor the easily factorable numbers (using trial division by known primes up to a million) and cancel common factors, which helps, but not enough. My errors are approximately 1.0E-13.

I am working in C#, but any code that works with IEEE standard floating point numbers is welcome.

RESEARCH:

I came across a good paper that discusses EFT (Error Free Transformations) for + - x /, Horner's Rule (polynomials), and square root. The title is "4ccurate 4lgorithms in Floating Point 4rithmetic" by Philippe Langlois. See http://www.mathematik.hu-berlin.de/~gaggle/S09/AUTODIFF/projects/papers/langlois_4ccurate_4lgorithms_in_floating_point_4rithmetic.pdf

The above pointed me to Karp and Markstein (for division): https://cr.yp.to/bib/1997/karp.pdf

384

asked May 16 '16 14:05

Paul Chernoch

2 Answers

What techniques work for division?

For the division a/b, you can evaluate the residue (remainder):

a = b*q + r

This remainder r is easily accessible if you have fused-multiply-add

q = a/b ;
r = fma(b,q,-a) ;

The same fma trick can be applied on multiplication:

y = a*b ;
r = fma(a,b,-y) ; // the result is y+r

Then if you end up with two approximate operands after the products (a0+ra) / (b0+rb), you are interested in (a0+ra) = q*(b0+rb) + r.
You can first evaluate:

q0 = a0/b0 ;
r0 = fma(b0,q0,-a0);

Then approximate the remainder as:

r = fma(q0,rb,r0-ra);

Then correct the quotient as:

q = q0 + r/b0;

EDIT: What if fma is not available?

We can emulate the fma using an exact product à la Dekker, which is decomposed into the exact sum of 2 floating point, then a Boldo-Melquiond roundToOdd trick to be sure to have the sum of 3 floating point exactly rounded.

But it's going to be overkill. We use fma only for evaluating the residual error, so we generally have c very close to -ab. In this case, ab+c is exact, and we only have 2 floating points to sum, not 3.

Anyway, we only roughly estimate the residual error of a bunch of operations, so the last bit of this residue would not have been that important.

So the fma can be written like this:

/* extract the high 26 bits of significand */
double upperHalf( double x ) {
    double secator = 134217729.0; /* 1<<27+1 */
    double p = x * secator; /* simplified... normally we should check if overflow and scale down */
    return p + (x - p);
}

/* emulate a fused multiply add: roundToNearestFloat(a*b+c)
   Beware: use only when -c is an approximation of a*b
   otherwise there is NO guaranty of correct rounding */
double emulated_fma(a,b,c) {
    double aup = upperHalf(a);
    double alo = a-aup;
    double bup = upperHalf(b);
    double blo = b-bup;

    /* compute exact product of a and b
       which is the exact sum of ab and a residual error resab */
    double high = aup*bup; 
    double mid  = aup*blo + alo*bup;
    double low  = alo*blo;
    double ab = high + mid;
    double resab = (high - ab) + mid + low;

    double fma = ab + c; /* expected to be exact, so don't bother with residual error */
    return resab + fma;
}

Well, a bit less overkill than the general emulated fma, but it might be more clever to use a language that provides native fma for this part of the job...

answered Sep 30 '22 07:09

aka.nice

The multiplication equivalent of Kahan summation you are looking for is “double-double multiplication”. Here, if your integers are representable as double values, the function Mul122 from crlibm is mostly enough.

#define Mul122(resh,resl,a,bh,bl)                 \
{                                                 \
    double _t1, _t2, _t3, _t4;                    \
                                                  \
    Mul12(&_t1,&_t2,(a),(bh));                    \
    _t3 = (a) * (bl);                             \
    _t4 = _t2 + _t3;                              \
    Add12((*(resh)),(*(resl)),_t1,_t4);           \
}

bh and bl are the running product stored with additional precision as the sum of two double values. a is the next integer (we are assuming it is exactly converted to a double). resh and resl receive the next running product, in which the factor a has been taken into account.

In order to avoid underflow and overflow, you can externalize the exponent to an integer of the width you wish. This is done by periodically applying the frexp function to the high part of the running product, and then normalizing the running product by dividing both components by the same power of two (tracking the total power of two by which the running product has been divided can be done on the side with an integer variable of the desired width).

How often to apply frexp depends on the bound you have on the integers you are multiplying. If the integers are below 2⁵³, which would help with them being exactly representable as double values, you can do about 19 multiplications before having to normalize the running product, because the double-precision exponent goes up to 1023.

Once you have computed the products corresponding to the numerator and to the denominator, throw away the low components, and divide the high components. This will only introduce an error of about 1ULP. You were not aiming for an error of less than a double-precision ULP, were you?

Do not forget the powers of two that you left on the side for both numerator and denominator! Subtract them and apply the difference to the quotient with the ldexp function.

answered Sep 30 '22 07:09

Pascal Cuoq

Related questions
                            
                                Why is it not possible to obtain mouse coordinates as floating point values?
                            
                                C: unordered floating-point comparison does not raise FE_INVALID
                            
                                How do I format a floating point value so that it never uses exponent notation nor has trailing zeros?
                            
                                Strange behavior of program in GNU C++, using floating-point numbers
                            
                                Vectorizable implementation of complementary error function erfcf()
                            
                                Does casting and assignment really strip away any extra precision of floats?
                            
                                Dictionary record is truncated when printing
                            
                                Valid syntax of calling pseudo-destructor for a floating constant
                            
                                Library for strict floating point Math in .NET
                            
                                Modulo (%) works with Decimal datatype but not with float or real. WHY?
                            
                                iPhone/iPad double precision math
                            
                                C++ Portable Floating-Point Bit Representation?
                            
                                Go atomic.AddFloat32()
                            
                                May C evaluate with multiple floating-point formats?
                            
                                Generate uniformly random float which can return all possible values
                            
                                C++/CLI: SIGFPE, _control87, _fpreset, porting ancient unmanaged Watcom C app to .NET
                            
                                Single precision float emulation in Javascript (float32)
                            
                                Floating point anomaly when an unused statement is not commented out?
                            
                                Floating point maxing out loop doesn't terminate in D, works in C++
                            
                                Converting an arbitrary-precision rational number (OCaml, zarith) to an approximate floating number

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Minimize rounding error when dividing by integers

Tags:

floating-point

division

numerical-methods

rounding-error

Paul Chernoch

People also ask

2 Answers

aka.nice

Pascal Cuoq

Recent Activity

Donate For Us