Why two float type variables have different values [duplicate]

Question

I have two integer vectors of nearly 1000 size, and what I am going to do is to check whether the sum of the square integer for these two vectors is the same or not. So I write the following codes:

std::vector<int> array1;
std::vector<int> array2;
... // initialize array1 and array2, and in the experiment all elements
    // in the two vectors are the same but the sequence of elements may be different.
    // For example: array1={1001, 2002, 3003, ....} 
   //               array2={2002, 3003, 1001, ....}
assert(array1.size() == array2.size());
float sum_array1 = 0;
float sum_array2 = 0;
for(int i=0; i<array1.size(); i++)
       sum_array1 +=array1[i]*array1[i];
for(int i=0; i<array2.size(); i++)
       sum_array2 +=array2[i]*array2[i];

I expect that sum_array1 should be equal to sum_array2, but in fact in my application I found they were different sum_array1 = 1.2868639e+009 while sum_array2 = 1.2868655e+009. What I have done next is to change the type of sum_array1 and sum_array2 to double type as the following codes show:

 double sum_array1 = 0;
    double sum_array2 = 0;
    for(int i=0; i<array1.size(); i++)
           sum_array1 +=array1[i]*array1[i];
    for(int i=0; i<array2.size(); i++)
           sum_array2 +=array2[i]*array2[i];

This time sum_array1 is equal to sum_array2 sum_array1=sum_array2=1286862225.0000000. My question is why it could happen. Thanks.

Mike Seymour · Accepted Answer

Floating point values have a finite size, and can therefore only represent real values with a finite precision. This leads to rounding errors when you need more precision than they can store.

In particular, when adding a small number (such as those you're summing) to a much larger number (such as your accumulator), the loss of precision can be quite large compared with the small number, giving a significant error; and the errors will be different depending on the order.

Typically, float has 24 bits of precision, corresponding to about 7 decimal places. Your accumulator requires 10 decimal places (around 30 bits), so you will experience this loss of precision. Typically, double has 53 bits (about 16 decimal places), so your result can be represented exactly.

A 64-bit integer may be the best option here, since all the inputs are integers. Using an integer avoids loss of precision, but introduces a danger of overflow if the inputs are too many or too large.

To minimise the error if you can't use a wide enough accumulator, you could sort the input so that the smallest values are accumulated first; or you could use more complicated methods such as Kahan summation.

Timothy Shields · Answer

In the two loops you're adding the same numbers but in different orders. As soon as sums exceed the integer value that can be exactly represented by a float, you're going to start losing precision, and the sums may end up slightly different.

An experiment for you to try:

float n = 0;
while (n != n + 1)
    n = n + 1;
//Will this terminate? If so, what is n now?

If you run this, you'll find that the loop actually terminates - which seems completely counter-intuitive, but is correct behavior according to the definitions for IEEE single-precision floating point arithmetic.

You can try the same experiment, replacing float with double. You'll witness the same strange behavior, but this time the loop will terminate when n is much larger, because IEEE double-precision floating point numbers enable a much finer accuracy.

Why two float type variables have different values [duplicate]

Tags:

c++

feelfree

2 Answers

Mike Seymour

Timothy Shields

Recent Activity

Donate For Us

Why two float type variables have different values [duplicate]

Tags:

c++

feelfree

2 Answers

Mike Seymour

Timothy Shields

Related questions

Recent Activity

Donate For Us