Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

I have a problem that can be reduced to this problem statement:

Given a series of doubles where each is in the range [0, 1e7], modify the last element such that the sum of the numbers equals exactly a target number. The series of doubles already sums to the target number within an epsilon (1e-7), but they are not ==.


The following code is working, but is it guaranteed to work for all inputs that meet the requirements described in the first sentence?

public static double[] FixIt(double[] input, double targetDouble)
{
    var result = new double[input.Length];
    if (input.Length == 0) return result;

    double sum = 0;
    for (int i = 0; i < input.Length - 1; i++)
    {
        sum += input[i];
        result[i] = input[i];
    }

    double remainder = targetDouble - sum;
    result[result.Length - 1] = remainder;
    return result;
}

var arr1 = Enumerable.Repeat(Math.PI / 13, 13).ToArray();
var arr2 = FixIt(arr1, Math.PI);

Debug.Print(Math.PI.ToString("R")); //3.1415926535897931
Debug.Print(arr1.Sum().ToString("R")); //3.1415926535897922
Debug.Print(arr2.Sum().ToString("R")); //3.1415926535897931

A previous version of this question asked about modifying the first element, but modifying the last element simplifies the problem to a known sum and a known target, leaving us with just the question of whether last = target-sum implies that sum+last == target.

(Without NaN of course, and the restrictions on ranges imply some restrictions on last as well that might help.)

Regarding the real problem: We've had this problem surface a number of times in a variety of ways, but what we are trying to do at the moment is reduce the floating point error that crops up due to numerical instabilities in a linear programming solver (Coin-OR CBC). For example, there are 6 variables which all must be in the range [0,X] and the sum of the variables must also be X. Due to numerical instability, the solver occasionally returns slightly negative values and values that do not sum to exactly X. We've overcome the negative number issues - now just trying to resolve the sum to X issue. (Yes, there may be constraints that are being disobeyed by us changing the results, but making sure these numbers sum to X is of higher priority, where the other constraints are not as important.)

like image 400
MineR Avatar asked Dec 17 '22 17:12

MineR


1 Answers

z = x-y; does not guarantee z+y == x, and there is not always a solution for the problem of finding a z such z+y == x. A proof follows.

We assume IEEE-754 binary floating-point arithmetic with rounding to nearest, ties to even. The basic 64-bit format is used, but the result holds for other formats. Note that the 64-bit format uses 53-bit significands, meaning that only numbers with 53 or fewer significant binary digits can be represented.

Consider a target x equal to 1+2−52. Let y be 2−53. Then, after z = x-y;, z+y == x evaluates to false. The arithmetic details are shown below, but:

  • z = x-y; sets z to 1, and then z+y produces 1, which is less than x.
  • If we increase z to the next representable number, 1+2−52, then z+y produces 1+2−51, which is greater than x.
  • So there is no value of z that makes z+y == x true.

Details:

The mathematical result of xy is 1+2−53. As this has 54 significant bits (from 20 to 2−53), it is not representable, and the computed result of x-y must be rounded. The two nearest numbers are 1 and 1+2−52. The ties-to-even rule produces the former number, 1, as the low bit of its significand is 0, while the low bit for 1+2−52 is 1.

Thus z = x-y; sets z to 1.

Then the mathematical result of z+y is 1+2−53. As above, this is rounded to 1, so the computed result of z+y is 1. So z+y == x compares 1 to 1+2−52 and produces false.

Furthermore, no value of z could make the comparison true. If we increment z by the smallest available step, from 1 to 1+2−52, the mathematical sum of z+y is then 1+2−52+2−53. This is midway between the two representable numbers 1+2−52 and 1+2−51. The former has a low bit of 1, and the latter has a low bit of 0, so the computed result of this z+y is 1+2−51, which is of course not equal to 1+2−52.

Floating-point addition is weakly monotonic, so there are no values of z that would produce 1+2−52 for z+y.

like image 160
Eric Postpischil Avatar answered Dec 24 '22 02:12

Eric Postpischil