I am trying to sum a sorted array of positive decreasing floating points. I have seen that the best way to sum them is to start adding up numbers from lowest to highest. I wrote this code to have an example of that, however, the sum that starts on the highest number is more precise. Why? (of course, the sum 1/k^2 should be f=1.644934066848226).
#include <stdio.h>
#include <math.h>
int main() {
double sum = 0;
int n;
int e = 0;
double r = 0;
double f = 1.644934066848226;
double x, y, c, b;
double sum2 = 0;
printf("introduce n\n");
scanf("%d", &n);
double terms[n];
y = 1;
while (e < n) {
x = 1 / ((y) * (y));
terms[e] = x;
sum = sum + x;
y++;
e++;
}
y = y - 1;
e = e - 1;
while (e != -1) {
b = 1 / ((y) * (y));
sum2 = sum2 + b;
e--;
y--;
}
printf("sum from biggest to smallest is %.16f\n", sum);
printf("and its error %.16f\n", f - sum);
printf("sum from smallest to biggest is %.16f\n", sum2);
printf("and its error %.16f\n", f - sum2);
return 0;
}
The data type float has 24 bits of precision. This is equivalent to only about 7 decimal places. (The rest of the 32 bits are used for the sign and size of the number.) The number of places of precision for float is the same no matter what the size of the number.
A decimal floating-point value is an IEEE 754r number with a decimal point. The position of the decimal point is stored in each decimal floating-point value. The maximum precision is 34 digits.
Floating-point decimal values generally do not have an exact binary representation due to how the CPU represents floating point data. For this reason, you may experience a loss of precision, and some floating-point operations may produce unexpected results.
A signed 32-bit integer variable has a maximum value of 231 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2−23) × 2127 ≈ 3.4028235 × 1038.
The precision of a floating-point number defines how many significant digits it can represent without information loss. When outputting floating-point numbers, cout has a default precision of 6 and it truncates anything after that. Below are a few libraries and methods which are used to provide precision to floating-point numbers in C++:
Using this concept, we can also find the summation of large floating-point numbers. Steps to add the two given floating-point numbers: Split both the given floating-point number in form of a string with respect to the decimal point to separate the fractional and integer part of the numbers.
A float has 23 bits of mantissa, so the precision we have at 3.5 is: 3.5 itself is actually exactly representable by a float, double or half, but the amount of precision numbers have at that scale is that value. The smallest number you can add or subtract to a value between 2 and 4 is that value.
Floating Point Arithmetic: Issues and Limitations ¶ Floating-point numbers are represented in computer hardware as base 2 (binary) fractions. For example, the decimal fraction has value 1/10 + 2/100 + 5/1000, and in the same way the binary fraction
Your code creates an array double terms[n];
on the stack, and this puts a hard limit on the number of iterations that can be performed before your program crashes.
But you don't even fetch anything from this array, so there's no reason to have it there at all. I altered your code to get rid of terms[]
:
#include <stdio.h>
int main() {
double pi2over6 = 1.644934066848226;
double sum = 0.0, sum2 = 0.0;
double y;
int i, n;
printf("Enter number of iterations:\n");
scanf("%d", &n);
y = 1.0;
for (i = 0; i < n; i++) {
sum += 1.0 / (y * y);
y += 1.0;
}
for (i = 0; i < n; i++) {
y -= 1.0;
sum2 += 1.0 / (y * y);
}
printf("sum from biggest to smallest is %.16f\n", sum);
printf("and its error %.16f\n", pi2over6 - sum);
printf("sum from smallest to biggest is %.16f\n", sum2);
printf("and its error %.16f\n", pi2over6 - sum2);
return 0;
}
When this is run with, say, a billion iterations, the smallest-first approach is considerably more accurate:
Enter number of iterations:
1000000000
sum from biggest to smallest is 1.6449340578345750
and its error 0.0000000090136509
sum from smallest to biggest is 1.6449340658482263
and its error 0.0000000009999996
When you add two floating-point numbers with different orders of magnitude, the lower order bits of the smallest number are lost.
When you sum from smallest to largest, the partial sums grow like Σ1/k²
for k
from N
to n
, i.e. approximately 1/n-1/N
(in blue), to be compared to 1/n²
.
When you sum from largest to smallest, the partial sums grow like Σ1/k²
for k
from n
to N
, which is about π²/6-1/n
(in green) to be compared to 1/n²
.
It is clear that the second case results in many more bit losses.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With