Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimize floating point error when adding multiple floating point variables

In my c++ app i have a vector of doubles in the range (0,1) and i have to calculate its total as accurately as possible. It feels like this issue should have been addressed before, but i cant find anything.

Obviously iterating through each item on the vector and doing sum+=vect[i] accumulates a significant error if the vector size is large and there are items which are significantly smaller then the others.

My current solution is this function:

double sumDoubles(vector<double> arg)// pass by copy
{
  sort(arg.rbegin(),arg.rend());  // sort in reverse order
  for(int i=1;i<=arg.size();i*=2)
    for(int j=0;j<arg.size()-i;j+=(2*i))
        arg[j]+=arg[j+i];
  return arg[0];
}

Basically it sorts the input in ascending order and calculates pairwise sums:

a+b+c+d+e+f+g+h=((a+b)+(c+d))+((e+f)+(g+h))

Like constructing a binary tree, but doing it in place. Sorting should ensure that at each step the two summands are of comparable magnitude.

The code above does perform better than a single loop with accumulative sum. However i am curious if it is possible to increase precision further while not degrading performance too much.

like image 595
QuarterlyQuotaOfQuotes Avatar asked Oct 02 '13 04:10

QuarterlyQuotaOfQuotes


People also ask

How can the floating point error be reduced?

One method to reduce high floating-point errors is to use higher precision to perform floating- point calculation of the original program. For example, one may replace a 32-bit single precision with a 64-bit double precision to improve the accuracy of results.

What is the main problem with floating point representation?

Accuracy in floating point representation is governed by number of significant bits, whereas range is limited by exponent. Not all real numbers can exactly be represented in floating point format.

Why are floating point calculations so inaccurate in Python?

It's a problem caused when the internal representation of floating-point numbers, which uses a fixed number of binary digits to represent a decimal number. It is difficult to represent some decimal number in binary, so in many cases, it leads to small roundoff errors.


1 Answers

One of the standard ways of approaching this issue is the Kahan summation algorithm. This algorithm reduces your worst-case error to being dependent upon your floating point precision rather than growing proportional to the length of your vector (and does it in O(n) time, albeit with more calculations per iteration).

Kahan summation will likely outperform your current sumDoubles due to your sorting every call, and will additionally improve pairwise summation's error growth of O(log n) to O(1). This said, if the sort is unnecessary, pairwise summation will likely outperform Kahan summation (due to the additional per-iteration math involved) with what may be (for your circumstances) minimal error growth.

like image 127
Elliot Robinson Avatar answered Sep 22 '22 12:09

Elliot Robinson