Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Float is more precise than double?

Tags:

c

There is a sequence:

x(n+2)=9/4*x(n+1)-1/2*x(n)

x(1)=1/3,x(2)=1/12

The exact result is x(n)=4^(1-n)/3

I wish to show the rounding error of x(60) in calculation.

My code is

#include <stdio.h>
#include <math.h>
int main(void)
{
    float x[60];
    x[0] = 1./3;
    x[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        x[i] = 9./4*x[i-1]-1./2*x[i-2];
    }
    double y[60];
    y[0] = 1./3;
    y[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        y[i] = 9./4*y[i-1]-1./2*y[i-2];
    }
    printf("single:%g, double:%g, exact:%g\n", x[59], y[59], pow(4,-59)/3);
    return 0;
}

I compile it with gcc:

gcc seq.c

The output is:

single:1.00309e-36, double:1.71429, exact:1.00309e-36

If I change the code above like this:

#include <stdio.h>
#include <math.h>
int main(void)
{
    float x[60];
    x[0] = 1.f/3;
    x[1] = 1.f/12;
    for (int i = 2; i < 60; i++) {
        x[i] = 9.f/4*x[i-1]-1.f/2*x[i-2];
    }
    double y[60];
    y[0] = 1./3;
    y[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        y[i] = 9./4*y[i-1]-1./2*y[i-2];
    }
    printf("single:%g, double:%g, exact:%g\n", x[59], y[59], pow(4,-59)/3);
    return 0;
}

where 'f' is added after constant floating-point number for calculation of x-array.

The output seems normal:

single:-9.2035e+08, double:1.71429, exact:1.00309e-36

My question is:

Why does the result of float data type equal to the exact result in the first situation?

What does the compiler do?

like image 615
zdfzr Avatar asked Mar 16 '19 04:03

zdfzr


1 Answers

float is not more precise than double, and your float computation has not given you the exact result of pow(4,-59)/3.

What's going on is that your recurrence is designed to take a tiny rounding error and amplify it every iteration. In exact math, each value should be exactly one quarter of the previous value, but if it's not exactly a quarter due to rounding error, the difference gets magnified on every step.

Since a quarter of a representable value is always representable (until you hit subnormal numbers and underflow issues), the recurrence has an additional property: if the computation is performed in a precision sufficiently in excess of the precision with which the results are stored, then rounding the results to lower precision for storage will round to exactly a quarter of the previous value. (The choice of 9./4 and 1./2 factors give the recurrence an even stronger version of this property, where the result is exactly a quarter of the old value even before rounding for storage.)


With doubles, with the compiler and compiler settings you're using, the rounding error occurs and gets amplified. With floats, the computations are performed in double precision, eliminating rounding error in the recurrence steps due to the properties described above, so there is nothing to amplify. If the computation for doubles had been performed at long double precision, the same thing would have happened.


Let's take a closer look at the exact values produced, by using the %a format specifier to print floating-point numbers in hexadecimal notation. That looks like 0x1.5555555555558p-6, where the part between 0x and p is a hexadecimal number, and the part after the p is a decimal number representing a power of two to multiply the hexadecimal number by. Here, 0x1.5555555555558p-6 represents 0x1.5555555555558 times 2^-6. %a format always prints the exact value of a float or double, unlike %g, which rounds.

We'll also show a third computation, storing results as doubles, but doing the math in long double precision.

Our altered program looks like this:

#include <stdio.h>
#include <math.h>
int main(void)
{
    float x[60];
    x[0] = 1./3;
    x[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        x[i] = 9./4*x[i-1]-1./2*x[i-2];
    }
    double y[60];
    y[0] = 1./3;
    y[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        y[i] = 9./4*y[i-1]-1./2*y[i-2];
    }
    double z[60];
    z[0] = 1./3;
    z[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        z[i] = (long double) 9./4*z[i-1] - (long double) 1./2*z[i-2];
    }
    printf("float:%a, double:%a, double2:%a, formula:%a\n", x[59], y[59], z[59], pow(4,-59)/3);
    for (int i = 0; i < 60; i++) {
        printf("%d %a %a %a\n", i, x[i], y[i], z[i]);
    }
    return 0;
}

And here's the output. I was going to abridge this, but it turns out it's hard to do that without obscuring interesting parts of the pattern:

float:0x1.555556p-120, double:0x1.b6db6db6db6dap+0, double2:0x1.5555555555555p-120, formula:0x1.5555555555555p-120
0 0x1.555556p-2 0x1.5555555555555p-2 0x1.5555555555555p-2
1 0x1.555556p-4 0x1.5555555555555p-4 0x1.5555555555555p-4
2 0x1.555556p-6 0x1.5555555555558p-6 0x1.5555555555555p-6
3 0x1.555556p-8 0x1.555555555557p-8 0x1.5555555555555p-8
4 0x1.555556p-10 0x1.555555555563p-10 0x1.5555555555555p-10
5 0x1.555556p-12 0x1.5555555555c3p-12 0x1.5555555555555p-12
6 0x1.555556p-14 0x1.5555555558c3p-14 0x1.5555555555555p-14
7 0x1.555556p-16 0x1.5555555570c3p-16 0x1.5555555555555p-16
8 0x1.555556p-18 0x1.5555555630c3p-18 0x1.5555555555555p-18
9 0x1.555556p-20 0x1.5555555c30c3p-20 0x1.5555555555555p-20
10 0x1.555556p-22 0x1.5555558c30c3p-22 0x1.5555555555555p-22
11 0x1.555556p-24 0x1.5555570c30c3p-24 0x1.5555555555555p-24
12 0x1.555556p-26 0x1.5555630c30c3p-26 0x1.5555555555555p-26
13 0x1.555556p-28 0x1.5555c30c30c3p-28 0x1.5555555555555p-28
14 0x1.555556p-30 0x1.5558c30c30c3p-30 0x1.5555555555555p-30
15 0x1.555556p-32 0x1.5570c30c30c3p-32 0x1.5555555555555p-32
16 0x1.555556p-34 0x1.5630c30c30c3p-34 0x1.5555555555555p-34
17 0x1.555556p-36 0x1.5c30c30c30c3p-36 0x1.5555555555555p-36
18 0x1.555556p-38 0x1.8c30c30c30c3p-38 0x1.5555555555555p-38
19 0x1.555556p-40 0x1.8618618618618p-39 0x1.5555555555555p-40
20 0x1.555556p-42 0x1.e186186186186p-39 0x1.5555555555555p-42
21 0x1.555556p-44 0x1.bc30c30c30c3p-38 0x1.5555555555555p-44
22 0x1.555556p-46 0x1.b786186186185p-37 0x1.5555555555555p-46
23 0x1.555556p-48 0x1.b6f0c30c30c3p-36 0x1.5555555555555p-48
24 0x1.555556p-50 0x1.b6de186186185p-35 0x1.5555555555555p-50
25 0x1.555556p-52 0x1.b6dbc30c30c3p-34 0x1.5555555555555p-52
26 0x1.555556p-54 0x1.b6db786186185p-33 0x1.5555555555555p-54
27 0x1.555556p-56 0x1.b6db6f0c30c3p-32 0x1.5555555555555p-56
28 0x1.555556p-58 0x1.b6db6de186185p-31 0x1.5555555555555p-58
29 0x1.555556p-60 0x1.b6db6dbc30c3p-30 0x1.5555555555555p-60
30 0x1.555556p-62 0x1.b6db6db786185p-29 0x1.5555555555555p-62
31 0x1.555556p-64 0x1.b6db6db6f0c3p-28 0x1.5555555555555p-64
32 0x1.555556p-66 0x1.b6db6db6de185p-27 0x1.5555555555555p-66
33 0x1.555556p-68 0x1.b6db6db6dbc3p-26 0x1.5555555555555p-68
34 0x1.555556p-70 0x1.b6db6db6db785p-25 0x1.5555555555555p-70
35 0x1.555556p-72 0x1.b6db6db6db6fp-24 0x1.5555555555555p-72
36 0x1.555556p-74 0x1.b6db6db6db6ddp-23 0x1.5555555555555p-74
37 0x1.555556p-76 0x1.b6db6db6db6dbp-22 0x1.5555555555555p-76
38 0x1.555556p-78 0x1.b6db6db6db6dap-21 0x1.5555555555555p-78
39 0x1.555556p-80 0x1.b6db6db6db6dap-20 0x1.5555555555555p-80
40 0x1.555556p-82 0x1.b6db6db6db6dap-19 0x1.5555555555555p-82
41 0x1.555556p-84 0x1.b6db6db6db6dap-18 0x1.5555555555555p-84
42 0x1.555556p-86 0x1.b6db6db6db6dap-17 0x1.5555555555555p-86
43 0x1.555556p-88 0x1.b6db6db6db6dap-16 0x1.5555555555555p-88
44 0x1.555556p-90 0x1.b6db6db6db6dap-15 0x1.5555555555555p-90
45 0x1.555556p-92 0x1.b6db6db6db6dap-14 0x1.5555555555555p-92
46 0x1.555556p-94 0x1.b6db6db6db6dap-13 0x1.5555555555555p-94
47 0x1.555556p-96 0x1.b6db6db6db6dap-12 0x1.5555555555555p-96
48 0x1.555556p-98 0x1.b6db6db6db6dap-11 0x1.5555555555555p-98
49 0x1.555556p-100 0x1.b6db6db6db6dap-10 0x1.5555555555555p-100
50 0x1.555556p-102 0x1.b6db6db6db6dap-9 0x1.5555555555555p-102
51 0x1.555556p-104 0x1.b6db6db6db6dap-8 0x1.5555555555555p-104
52 0x1.555556p-106 0x1.b6db6db6db6dap-7 0x1.5555555555555p-106
53 0x1.555556p-108 0x1.b6db6db6db6dap-6 0x1.5555555555555p-108
54 0x1.555556p-110 0x1.b6db6db6db6dap-5 0x1.5555555555555p-110
55 0x1.555556p-112 0x1.b6db6db6db6dap-4 0x1.5555555555555p-112
56 0x1.555556p-114 0x1.b6db6db6db6dap-3 0x1.5555555555555p-114
57 0x1.555556p-116 0x1.b6db6db6db6dap-2 0x1.5555555555555p-116
58 0x1.555556p-118 0x1.b6db6db6db6dap-1 0x1.5555555555555p-118
59 0x1.555556p-120 0x1.b6db6db6db6dap+0 0x1.5555555555555p-120

Here, we see first that the float computation didn't produce the exact value the pow formula gave (it doesn't have enough precision for that), but it was close enough that the difference was hidden by %g's rounding. We also see that the float values are decreasing by exactly a factor of 4 each time, as are the values from the altered double computation. The double values from the original double version start out almost doing that and then diverge once the amplified error overwhelms the computation. The values eventually start increasing by a factor of 2 instead of decreasing by a factor of 4.

like image 200
user2357112 supports Monica Avatar answered Sep 29 '22 01:09

user2357112 supports Monica