I am getting strange answer when I am performing sum on certain set of records. in one case i am not using the %d and in the next case i am using the %d
the first expression of sum of using %d
awk -F"|" '(NR > 0 && NR < 36) {sum +=$150} END {printf ("%d\n",sum)}' muar.txt
-|33
without %d
awk -F"|" '(NR > 0 && NR < 36) {sum +=$150} END {printf ("\n"sum)}' muar.txt
-|34
Why it is rounding to 33 from 34
Just to add more Info, till 34 row I am getting sum as 33.03 and the 35th row has value 0.97 so actually it should be 34 rather than 33
Additional Detail as per Comments for Testing -you can create a file let's a.txt having Only One Field. the first value is blank second one is 1.95 then 18 times 097 in a row, then 0.98 then 6 times 0.97 then 0.98 then 3 times 0.97 then 0.98 2 times then 2 times 0.97
Or You can have 1.95 - 1 time , 0.97 - 29 times, and 0.98 4 times all one below other in a row
The answer to your question is two fold:
awk
does some internal conversionOne of your examples was : 1.95 + 29*0.97 + 4*0.98. We can all agree that the sum of this value is 34 exactly. The little `awk program below, does the computation in two different ways leading to remarkable results :
awk 'BEGIN{sum1=1.95 + 29*0.97 + 4*0.98
sum2=1.95;
for(i=1;i<=29;i++){sum2+=0.97};
for(i=1;i<=4;i++) {sum2+=0.98};
printf "full precision : %25.16f%25.16f\n",sum1,sum2
printf "integer conversion : %25d%25d\n" ,sum1,sum2
printf "string conversion : "sum1" "sum2"\n"
}'
which leads to the following output (first column sum1
second column sum2
full precision : 34.0000000000000000 33.9999999999999787
integer conversion : 34 33
string conversion : 34 34
Why do the two sums have different results:
In essence, the 3 numbers 1.95
, 0.97
and 0.98
cannot be represented in a binary format. An approximation occurs which represents them as
1.95 ~ 1.94999999999999995559107901499...
0.97 ~ 0.96999999999999997335464740899...
0.98 ~ 0.97999999999999998223643160599...
when summing them as is done according to sum2
, the errors of the 33 additions grows and leads to the final result :
sum2 = 33.99999999999997868371792719699...
The error on sum1
is much smaller than sum2
as we only do 2 multiplications and 2 additions. In fact, the error evaporates to the correct result (i.e. the error is smaller the 10^-17
):
1.95 ~ 1.94999999999999995559107901499...
29*0.97 ~ 28.12999999999999900524016993586...
4*0.98 ~ 3.91999999999999992894572642399...
sum1 ~ 34.00000000000000000000000000000...
For a detailed understanding of the above, I refer to the obligatory article What Every Computer Scientist Should Know About Floating-Point Arithmetic
What is happening with the print statements?
awk
is essentially doing internal conversions:
printf "%d"
requests an integer, but it is served a float. awk
is receiving sum2
and converts it to an integer by removing the fractional part of the number, or you could imagine it feeds it trough int()
Thus 33.99999...
is converted to 33
.
printf ""sum2
, this is a conversion from a float to a string. Essentially by concatenating a string to a number, the number has to be converted in a string. If the number is a pure integer, it will just convert it as a pure integer. However, sum2
is a float.
The conversion of sum2
to a string is internally done with sprintf(CONVFMT,sum2)
where CONVFMT
is an awk built-in variable which is set to %.6g
. Thus sum2
is by default rounded to be represented with a maximum of 6 decimal digits. Hence ""sum2 -> "34"
.
Can we improve sum2
:
Yes! sum2
is nothing more than a representation of a sequence of numbers we want to add. It is not really practical to search for all the common terms first and the use multiplications as is done in sum1
. An improvement can be achieved using Kahan Summation. The idea behind it is to keep track of a compensation term representing the digits you lost.
The following program demonstrates it:
awk 'BEGIN{sum2=1.95;
for(i=1;i<=29;i++){sum2+=0.97};
for(i=1;i<=4;i++) {sum2+=0.98};
sum3=1.95; c=0
for(i=1;i<=29;i++) { y = 0.97 - c; t = sum3 + y; c = (t - sum3) - y; sum3 = t }
for(i=1;i<=4;i++) { y = 0.98 - c; t = sum3 + y; c = (t - sum3) - y; sum3 = t }
printf "full precision : %25.16f%25.16f\n",sum2,sum3
printf "integer conversion : %25d%25d\n" ,sum2,sum3
printf "string conversion : "sum2" "sum3"\n"
}'
which leads to the following output (first column sum2 second column sum3)
full precision : 33.9999999999999787 34.0000000000000000
integer conversion : 33 34
string conversion : 34 34
If you want to see the intermediate steps and difference between sum2
and sum3
you can check out the following code.
awk 'BEGIN{ sum2=sum3=1.95;c=0;
for(i=1;i<=29;i++) {
sum2+=0.97;
y = 0.97 - c; t = sum3 + y; c = (t - sum3) - y; sum3 = t;
printf "%25.16f%25.16f%25.16e\n", sum2,sum3,c
}
for(i=1;i<=4;i++) {
sum2+=0.98;
y = 0.98 - c; t = sum3 + y; c = (t - sum3) - y; sum3 = t;
printf "%25.16f%25.16f%25.16e\n", sum2,sum3,c
}
}'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With