I have a 'file.dat' with 24 (rows) x 16 (columns) data.
I have already tested the following awk script that computes de average of each column.
touch aver-std.dat
awk '{ for (i=1; i<=NF; i++) { sum[i]+= $i } }
END { for (i=1; i<=NF; i++ )
{ printf "%f \n", sum[i]/NR} }' file.dat >> aver-std.dat
The output 'aver-std.dat' has one column with these averages.
Similarly as the average computation I would like to compute the standard deviation of each column of the data file 'file.dat' and write it in a second column of the output file. Namely I would like an output file with the average in the first column and the standard deviation in the second column.
I have been making different tests, like this one
touch aver-std.dat
awk '{ for (i=1; i<=NF; i++) { sum[i]+= $i }}
END { for (i=1; i<=NF; i++ )
{std[i] += ($i - sum[i])^2 ; printf "%f %f \n", sum[i]/NR, sqrt(std[i]/(NR-1))}}' file.dat >> aver-std.dat
and it writes values in the second column but they are not the correct value of the standard deviation. The computation of the deviation is not right somehow. I would appreciate very much any help. Regards
Standard deviation is
stdev = sqrt((1/N)*(sum of (value - mean)^2))
But there is another form of the formula which does not require you to know the mean beforehand. It is:
stdev = sqrt((1/N)*((sum of squares) - (((sum)^2)/N)))
(A quick web search for "sum of squares" formula for standard deviation will give you the derivation if you are interested)
To use this formula, you need to keep track of both the sum and the sum of squares of the values. So your awk script will change to:
awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}}
END {for (i=1;i<=NF;i++) {
printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}
}' file.dat >> aver-std.dat
To simply calculate the population standard deviation of a list of numbers, you can use a command like this:
awk '{x+=$0;y+=$0^2}END{print sqrt(y/NR-(x/NR)^2)}'
Or this calculates the sample standard deviation:
awk '{sum+=$0;a[NR]=$0}END{for(i in a)y+=(a[i]-(sum/NR))^2;print sqrt(y/(NR-1))}'
^
is in POSIX. **
is supported by gawk
and nawk
but not by mawk
.
Here is some calculation I've made on a grinder data output file for a long soak test which had to be interrupted:
Standard deviation(biased) + average:
cat <grinder_data_file> | grep -v "1$" | awk -F ', ' '{ sum=sum+$5 ; sumX2+=(($5)^2)} END { printf "Average: %f. Standard Deviation: %f \n", sum/NR, sqrt(sumX2/(NR) - ((sum/NR)^2) )}'
Standard deviation(non-biased) + average:
cat <grinder_data_file> | grep -v "1$" | awk -F ', ' '{ sum=sum+$5 ; sumX2+=(($5)^2)} END { avg=sum/NR; printf "Average: %f. Standard Deviation: %f \n", avg, sqrt(sumX2/(NR-1) - 2*avg*(sum/(NR-1)) + ((NR*(avg^2))/(NR-1)))}'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With