Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The precision of a large floating point sum

I am trying to sum a sorted array of positive decreasing floating points. I have seen that the best way to sum them is to start adding up numbers from lowest to highest. I wrote this code to have an example of that, however, the sum that starts on the highest number is more precise. Why? (of course, the sum 1/k^2 should be f=1.644934066848226).

#include <stdio.h>
#include <math.h>

int main() {

    double sum = 0;
    int n;
    int e = 0;
    double r = 0;
    double f = 1.644934066848226;
    double x, y, c, b;
    double sum2 = 0;

    printf("introduce n\n");
    scanf("%d", &n);

    double terms[n];

    y = 1;

    while (e < n) {
        x = 1 / ((y) * (y));
        terms[e] = x;
        sum = sum + x;
        y++;
        e++;
    }

    y = y - 1;
    e = e - 1;

    while (e != -1) {
        b = 1 / ((y) * (y));
        sum2 = sum2 + b;
        e--;
        y--;
    }
    printf("sum from biggest to smallest is %.16f\n", sum);
    printf("and its error %.16f\n", f - sum);
    printf("sum from smallest to biggest is %.16f\n", sum2);
    printf("and its error %.16f\n", f - sum2);
    return 0;
}
like image 798
codingnight Avatar asked Mar 03 '18 23:03

codingnight


People also ask

What is the precision value of floating-point?

The data type float has 24 bits of precision. This is equivalent to only about 7 decimal places. (The rest of the 32 bits are used for the sign and size of the number.) The number of places of precision for float is the same no matter what the size of the number.

What is maximum precision value of floating value?

A decimal floating-point value is an IEEE 754r number with a decimal point. The position of the decimal point is stored in each decimal floating-point value. The maximum precision is 34 digits.

Why do floating-point numbers have limited precision?

Floating-point decimal values generally do not have an exact binary representation due to how the CPU represents floating point data. For this reason, you may experience a loss of precision, and some floating-point operations may produce unexpected results.

What is the largest floating-point value?

A signed 32-bit integer variable has a maximum value of 231 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2−23) × 2127 ≈ 3.4028235 × 1038.

What is the precision of a floating-point number?

The precision of a floating-point number defines how many significant digits it can represent without information loss. When outputting floating-point numbers, cout has a default precision of 6 and it truncates anything after that. Below are a few libraries and methods which are used to provide precision to floating-point numbers in C++:

How do you find the summation of large floating-point numbers?

Using this concept, we can also find the summation of large floating-point numbers. Steps to add the two given floating-point numbers: Split both the given floating-point number in form of a string with respect to the decimal point to separate the fractional and integer part of the numbers.

What is the precision of a double A float?

A float has 23 bits of mantissa, so the precision we have at 3.5 is: 3.5 itself is actually exactly representable by a float, double or half, but the amount of precision numbers have at that scale is that value. The smallest number you can add or subtract to a value between 2 and 4 is that value.

What is floating point arithmetic?

Floating Point Arithmetic: Issues and Limitations ¶ Floating-point numbers are represented in computer hardware as base 2 (binary) fractions. For example, the decimal fraction has value 1/10 + 2/100 + 5/1000, and in the same way the binary fraction


2 Answers

Your code creates an array double terms[n]; on the stack, and this puts a hard limit on the number of iterations that can be performed before your program crashes.

But you don't even fetch anything from this array, so there's no reason to have it there at all. I altered your code to get rid of terms[]:

#include <stdio.h>

int main() {

    double pi2over6 = 1.644934066848226;
    double sum = 0.0, sum2 = 0.0;
    double y;
    int i, n;

    printf("Enter number of iterations:\n");
    scanf("%d", &n);

    y = 1.0;

    for (i = 0; i < n; i++) {
        sum += 1.0 / (y * y);
        y += 1.0;
    }

    for (i = 0; i < n; i++) {
        y -= 1.0;
        sum2 += 1.0 / (y * y);
    }
    printf("sum from biggest to smallest is %.16f\n", sum);
    printf("and its error %.16f\n", pi2over6 - sum);
    printf("sum from smallest to biggest is %.16f\n", sum2);
    printf("and its error %.16f\n", pi2over6 - sum2);
    return 0;

}

When this is run with, say, a billion iterations, the smallest-first approach is considerably more accurate:

Enter number of iterations:
1000000000
sum from biggest to smallest is 1.6449340578345750
and its error 0.0000000090136509
sum from smallest to biggest is 1.6449340658482263
and its error 0.0000000009999996
like image 156
r3mainer Avatar answered Nov 11 '22 10:11

r3mainer


When you add two floating-point numbers with different orders of magnitude, the lower order bits of the smallest number are lost.

When you sum from smallest to largest, the partial sums grow like Σ1/k² for k from N to n, i.e. approximately 1/n-1/N (in blue), to be compared to 1/n².

When you sum from largest to smallest, the partial sums grow like Σ1/k² for k from n to N, which is about π²/6-1/n (in green) to be compared to 1/n².

It is clear that the second case results in many more bit losses.

enter image description here

like image 23
Yves Daoust Avatar answered Nov 11 '22 09:11

Yves Daoust