Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

use of %d is giving strange rounding values in Awk program

I am getting strange answer when I am performing sum on certain set of records. in one case i am not using the %d and in the next case i am using the %d

the first expression of sum of using %d

 awk -F"|" '(NR > 0 && NR < 36) {sum +=$150} END {printf ("%d\n",sum)}' muar.txt
-|33

without %d

 awk -F"|" '(NR > 0 && NR < 36) {sum +=$150} END {printf ("\n"sum)}' muar.txt
-|34

Why it is rounding to 33 from 34

Just to add more Info, till 34 row I am getting sum as 33.03 and the 35th row has value 0.97 so actually it should be 34 rather than 33

Additional Detail as per Comments for Testing -you can create a file let's a.txt having Only One Field. the first value is blank second one is 1.95 then 18 times 097 in a row, then 0.98 then 6 times 0.97 then 0.98 then 3 times 0.97 then 0.98 2 times then 2 times 0.97

Or You can have 1.95 - 1 time , 0.97 - 29 times, and 0.98 4 times all one below other in a row

like image 466
kunal .Gaurav Avatar asked Jan 29 '23 16:01

kunal .Gaurav


1 Answers

The answer to your question is two fold:

  • There is a numeric problem
  • awk does some internal conversion

One of your examples was : 1.95 + 29*0.97 + 4*0.98. We can all agree that the sum of this value is 34 exactly. The little `awk program below, does the computation in two different ways leading to remarkable results :

awk 'BEGIN{sum1=1.95 + 29*0.97 + 4*0.98
           sum2=1.95;
           for(i=1;i<=29;i++){sum2+=0.97};
           for(i=1;i<=4;i++) {sum2+=0.98};

           printf "full precision     : %25.16f%25.16f\n",sum1,sum2
           printf "integer conversion : %25d%25d\n"      ,sum1,sum2
           printf "string conversion  : "sum1" "sum2"\n"
}'

which leads to the following output (first column sum1 second column sum2

full precision     :       34.0000000000000000      33.9999999999999787
integer conversion :                        34                       33
string conversion  : 34 34

Why do the two sums have different results:

In essence, the 3 numbers 1.95, 0.97 and 0.98 cannot be represented in a binary format. An approximation occurs which represents them as

1.95 ~ 1.94999999999999995559107901499...
0.97 ~ 0.96999999999999997335464740899...
0.98 ~ 0.97999999999999998223643160599...

when summing them as is done according to sum2, the errors of the 33 additions grows and leads to the final result :

sum2 = 33.99999999999997868371792719699...

The error on sum1 is much smaller than sum2 as we only do 2 multiplications and 2 additions. In fact, the error evaporates to the correct result (i.e. the error is smaller the 10^-17):

   1.95 ~  1.94999999999999995559107901499...
29*0.97 ~ 28.12999999999999900524016993586...
 4*0.98 ~  3.91999999999999992894572642399...
   sum1 ~ 34.00000000000000000000000000000...

For a detailed understanding of the above, I refer to the obligatory article What Every Computer Scientist Should Know About Floating-Point Arithmetic

What is happening with the print statements?

awk is essentially doing internal conversions:

  • printf "%d" requests an integer, but it is served a float. awk is receiving sum2 and converts it to an integer by removing the fractional part of the number, or you could imagine it feeds it trough int() Thus 33.99999... is converted to 33.

  • printf ""sum2, this is a conversion from a float to a string. Essentially by concatenating a string to a number, the number has to be converted in a string. If the number is a pure integer, it will just convert it as a pure integer. However, sum2 is a float.

    The conversion of sum2 to a string is internally done with sprintf(CONVFMT,sum2) where CONVFMT is an awk built-in variable which is set to %.6g. Thus sum2 is by default rounded to be represented with a maximum of 6 decimal digits. Hence ""sum2 -> "34".

Can we improve sum2:

Yes! sum2 is nothing more than a representation of a sequence of numbers we want to add. It is not really practical to search for all the common terms first and the use multiplications as is done in sum1. An improvement can be achieved using Kahan Summation. The idea behind it is to keep track of a compensation term representing the digits you lost.

The following program demonstrates it:

awk 'BEGIN{sum2=1.95;
           for(i=1;i<=29;i++){sum2+=0.97};
           for(i=1;i<=4;i++) {sum2+=0.98};
           sum3=1.95; c=0
           for(i=1;i<=29;i++) { y = 0.97 - c; t = sum3 + y; c = (t - sum3) - y; sum3 = t }
           for(i=1;i<=4;i++)  { y = 0.98 - c; t = sum3 + y; c = (t - sum3) - y; sum3 = t }

           printf "full precision     : %25.16f%25.16f\n",sum2,sum3
           printf "integer conversion : %25d%25d\n"      ,sum2,sum3
           printf "string conversion  : "sum2" "sum3"\n"
}'

which leads to the following output (first column sum2 second column sum3)

full precision     :       33.9999999999999787      34.0000000000000000
integer conversion :                        33                       34
string conversion  : 34 34

If you want to see the intermediate steps and difference between sum2 and sum3 you can check out the following code.

 awk 'BEGIN{ sum2=sum3=1.95;c=0;
             for(i=1;i<=29;i++) {
                sum2+=0.97;
                y = 0.97 - c; t = sum3 + y; c = (t - sum3) - y; sum3 = t;
                printf "%25.16f%25.16f%25.16e\n", sum2,sum3,c
             }
             for(i=1;i<=4;i++) {
                sum2+=0.98;
                y = 0.98 - c; t = sum3 + y; c = (t - sum3) - y; sum3 = t;
                printf "%25.16f%25.16f%25.16e\n", sum2,sum3,c
             }
      }'
like image 162
kvantour Avatar answered Apr 09 '23 10:04

kvantour