the Floating-point error

Question

#include <stdio.h>
int main()
{
    int n;
    while ( scanf( "%d", &n ) != EOF ) {
        double sum = 0,k;
        if( n > 5000000 || n<=0 )   //the judgment of the arrange
            break;
        for ( int i = 1; i <= n; i++ ) {
            k = (double) 1 / i;
            sum += k;
        }
        /*
        for ( int i = n; i > 0; i-- ) {
            k = 1 / (double)i;
            sum += k;
        }
        */
        printf("%.12lf
", sum);
    }
    return 0;
}

Why in the different loop I get the different answer. Is there a float-error? When I input 5000000 the sum is 16.002164235299 but as I use the other loop of for (notation part) I get the sum 16.002164235300.

ArjunShankar · Accepted Answer

Because floating point math is not associative:

i.e. (a + b) + c is not necessarily equal to a + (b + c)

Stepan Dyatkovskiy · Answer

I also bumped into a + b + c issue. Totally agreed with ArjunShankar.

// Here A != B in general case
float A = ( (a + b) + c) );
float B = ( (a + c) + b) );

Most of floating point operations are performed with data loss in mantis, even when components are fit well in it (numbers like 0.5 or 0.25). In fact I was quite happy to find out the cause of bug in my application. I have written short reminder article with detailed explanation:

http://stepan.dyatkovskiy.com/2018/04/machine-fp-partial-invariance-issue.html

Below is the C example. Good luck!

example.c

#include <stdio.h>

// Helpers declaration, for implementation scroll down
float getAllOnes(unsigned bits);
unsigned getMantissaBits();

int main() {

  // Determine mantissa size in bits
  unsigned mantissaBits = getMantissaBits();

  // Considering mantissa has only 3 bits, we would then get:
  // a = 0b10   m=1,  e=1
  // b = 0b110  m=11, e=1
  // c = 0b1000 m=1,  e=3
  // a + b = 0b1000, m=100, e=1
  // a + c = 0b1010, truncated to 0b1000, m=100, e=1
  // a + b + c result: 0b1000 + 0b1000 = 0b10000, m=100, e=2
  // a + c + b result: 0b1000 + 0b110 = 0b1110, m=111, e=1

  float a = 2,
        b = getAllOnes(mantissaBits) - 1,
        c = b + 1;

  float ab = a + b;
  float ac = a + c;

  float abc = a + b + c;
  float acb = a + c + b;

  printf("
"
         "FP partial invariance issue demo:
"
         "
"
         "Mantissa size = %i bits
"
         "
"
         "a = %.1f
"
         "b = %.1f
"
         "c = %.1f
"
         "(a+b) result: %.1f
"
         "(a+c) result: %.1f
"
         "(a + b + c) result: %.1f
"
         "(a + c + b) result: %.1f
"
         "---------------------------------
"
         "diff(a + b + c, a + c + b) = %.1f

",
         mantissaBits,
         a, b, c,
         ab, ac,
         abc, acb,
         abc - acb);

  return 1;
}

// Helpers

float getAllOnes(unsigned bits) {
    return (unsigned)((1 << bits) - 1);
}

unsigned getMantissaBits() {

    unsigned sz = 1;
    unsigned unbeleivableHugeSize = 1024;
    float allOnes = 1;

    for (;sz != unbeleivableHugeSize &&
          allOnes + 1 != allOnes;
          allOnes = getAllOnes(++sz)
          ) {}

    return sz-1;
}

the Floating-point error

Tags:

c

floating-point

floating-point-conversion

Hong Wei

2 Answers

ArjunShankar

Stepan Dyatkovskiy

Recent Activity

Donate For Us

the Floating-point error

Tags:

c

floating-point

floating-point-conversion

Hong Wei

2 Answers

ArjunShankar

Stepan Dyatkovskiy

Related questions

Recent Activity

Donate For Us