Can anyone recommend any C++ libraries/routines/packages that contain strategies for maintaining the stability of various floating point operations?
Example: suppose you would like to sum across a vector/array of one million long double
in the unit interval (0,1), and that each number is of about the same order of magnitude. Naively summing for (int i=0;i<1000000;++i) sum += array[i];
is unreliable - for large enough i
, sum
will be of a much larger order of magnitude than array[i]
, and so sum += array[i]
would be equivalent to sum += 0.00
.
(Note: the solution to this example is a binary summing strategy.)
I deal with sums and products of thousands/millions of miniscule probabilities. I am using the arbitrary-precision library MPFRC++
with a 2048 bit significand, but the same concerns still apply.
I am chiefly concerned with:
Binary summation doesn't guarantee accurate result. The most reliable (albeit slower) method is to use Kahan summation. Boost.Accumulators has an implementation of the above and much more.
Multiplication and division stability: unless you get to denormalized floats they don't suffer from the same problems as summation and substraction. In fact the error of multiplication is at most 0.5 ulp (units last place).
... what should my normalization constant be?
What do you mean by 'normalize'? It depends on the norm you use. Possible candidates: use the maximum absolute value in the array, or any other generalized mean. (Other choices you listed do not work since they may be zero even for non-zero array.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With