I am trying to sum a sorted array of positive decreasing floating points. I have seen that the best way to sum them is to start adding up numbers from lowest to highest. I wrote this code to have an example of that, however, the sum that starts on the highest number is more precise. Why? (of course, the sum 1/k^2 should be f=1.644934066848226). <pre class="prettyprint"><code>#include <stdio.h> #include <math.h> int main() { double sum = 0; int n; int e = 0; double r = 0; double f = 1.644934066848226; double x, y, c, b; double sum2 = 0; printf("introduce n\n"); scanf("%d", &n); double terms[n]; y = 1; while (e < n) { x = 1 / ((y) * (y)); terms[e] = x; sum = sum + x; y++; e++; } y = y - 1; e = e - 1; while (e != -1) { b = 1 / ((y) * (y)); sum2 = sum2 + b; e--; y--; } printf("sum from biggest to smallest is %.16f\n", sum); printf("and its error %.16f\n", f - sum); printf("sum from smallest to biggest is %.16f\n", sum2); printf("and its error %.16f\n", f - sum2); return 0; } </code></pre>

When you add two floating-point numbers with different orders of magnitude, the lower order bits of the smallest number are lost. When you sum from smallest to largest, the partial sums grow like <code>Σ1/k²</code> for <code>k</code> from <code>N</code> to <code>n</code>, i.e. approximately <code>1/n-1/N</code> (in blue), to be compared to <code>1/n²</code>. When you sum from largest to smallest, the partial sums grow like <code>Σ1/k²</code> for <code>k</code> from <code>n</code> to <code>N</code>, which is about <code>π²/6-1/n</code> (in green) to be compared to <code>1/n²</code>. It is clear that the second case results in many more bit losses. <img src="https://i.stack.imgur.com/aHkpp.png" alt="enter image description here">

The precision of a large floating point sum

Tags:

c

floating-point

precision

numerical-methods

I am trying to sum a sorted array of positive decreasing floating points. I have seen that the best way to sum them is to start adding up numbers from lowest to highest. I wrote this code to have an example of that, however, the sum that starts on the highest number is more precise. Why? (of course, the sum 1/k^2 should be f=1.644934066848226).

Click to copy

#include <stdio.h>
#include <math.h>

int main() {

    double sum = 0;
    int n;
    int e = 0;
    double r = 0;
    double f = 1.644934066848226;
    double x, y, c, b;
    double sum2 = 0;

    printf("introduce n\n");
    scanf("%d", &n);

    double terms[n];

    y = 1;

    while (e < n) {
        x = 1 / ((y) * (y));
        terms[e] = x;
        sum = sum + x;
        y++;
        e++;
    }

    y = y - 1;
    e = e - 1;

    while (e != -1) {
        b = 1 / ((y) * (y));
        sum2 = sum2 + b;
        e--;
        y--;
    }
    printf("sum from biggest to smallest is %.16f\n", sum);
    printf("and its error %.16f\n", f - sum);
    printf("sum from smallest to biggest is %.16f\n", sum2);
    printf("and its error %.16f\n", f - sum2);
    return 0;
}

798

asked Mar 03 '18 23:03

codingnight

2 Answers

Your code creates an array double terms[n]; on the stack, and this puts a hard limit on the number of iterations that can be performed before your program crashes.

But you don't even fetch anything from this array, so there's no reason to have it there at all. I altered your code to get rid of terms[]:

Click to copy

#include <stdio.h>

int main() {

    double pi2over6 = 1.644934066848226;
    double sum = 0.0, sum2 = 0.0;
    double y;
    int i, n;

    printf("Enter number of iterations:\n");
    scanf("%d", &n);

    y = 1.0;

    for (i = 0; i < n; i++) {
        sum += 1.0 / (y * y);
        y += 1.0;
    }

    for (i = 0; i < n; i++) {
        y -= 1.0;
        sum2 += 1.0 / (y * y);
    }
    printf("sum from biggest to smallest is %.16f\n", sum);
    printf("and its error %.16f\n", pi2over6 - sum);
    printf("sum from smallest to biggest is %.16f\n", sum2);
    printf("and its error %.16f\n", pi2over6 - sum2);
    return 0;

}

When this is run with, say, a billion iterations, the smallest-first approach is considerably more accurate:

Click to copy

Enter number of iterations:
1000000000
sum from biggest to smallest is 1.6449340578345750
and its error 0.0000000090136509
sum from smallest to biggest is 1.6449340658482263
and its error 0.0000000009999996

156

answered Nov 11 '22 10:11

r3mainer

When you add two floating-point numbers with different orders of magnitude, the lower order bits of the smallest number are lost.

When you sum from smallest to largest, the partial sums grow like Σ1/k² for k from N to n, i.e. approximately 1/n-1/N (in blue), to be compared to 1/n².

When you sum from largest to smallest, the partial sums grow like Σ1/k² for k from n to N, which is about π²/6-1/n (in green) to be compared to 1/n².

It is clear that the second case results in many more bit losses.

enter image description here

answered Nov 11 '22 09:11

Yves Daoust

Related questions
                            
                                Allocating memory for data used by MTLBuffer in iOS Metal
                            
                                getc() as macro and C standard library function definition, coherent?
                            
                                'gtk/gtk.h' file not found Even with pkg-config
                            
                                CMake C compiler identification fails
                            
                                C - Accessing data AFTER memory has been free()ed?
                            
                                Reversing array in c - will not print -
                            
                                C - Fastest way to sort a large 2D integer array
                            
                                Moving to different Linux build system, getting error: undefined symbol: stat
                            
                                How to check if a float is infinity/zero/denormal?
                            
                                How to compile C program in GCC to enable debug in WinDbg?
                            
                                "Could not determine which "make" command to run. Check the "make" step in the build configuration." Qt creator
                            
                                How to connect and insert record to mysql using c language? [closed]
                            
                                Different gcc output for __builtin_clzll on different optimisation levels and wrapped in a function
                            
                                Size of a struct with flexible array member
                            
                                What is a non-field member of a structure or union?
                            
                                expected 'double **' but argument is of type 'double (*)[2]'
                            
                                How do I access local C variable in arm inline assembly?
                            
                                How to compare long doubles with qsort and with regard to NaN?
                            
                                Is the address of a variable in C the real address in the RAM of the computer?
                            
                                Does gcc links to libc.a or libc.so by default?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

The precision of a large floating point sum

Tags:

c

floating-point

precision

numerical-methods

codingnight

People also ask

2 Answers

r3mainer

Yves Daoust

Recent Activity

Donate For Us