Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Average of column by hours (rows) using awk

Tags:

unix

awk

I have the following rows in a file that I want to get the average of the 3rd column by hour.

2010-10-28 12:02:36: 5.1721851 secs
2010-10-28 12:03:43: 4.4692638 secs
2010-10-28 12:04:51: 3.3770310 secs
2010-10-28 12:05:58: 4.6227063 secs
2010-10-28 12:07:08: 5.1650404 secs
2010-10-28 12:08:16: 3.2819025 secs

2010-10-28 13:01:36: 2.1721851 secs
2010-10-28 13:02:43: 3.4692638 secs
2010-10-28 13:03:51: 4.3770310 secs
2010-10-28 13:04:58: 3.6227063 secs
2010-10-28 13:05:08: 3.1650404 secs
2010-10-28 13:06:16: 4.2819025 secs

2010-10-28 14:12:36: 7.1721851 secs
2010-10-28 14:23:43: 7.4692638 secs
2010-10-28 14:24:51: 7.3770310 secs
2010-10-28 14:25:58: 9.6227063 secs
2010-10-28 14:37:08: 7.1650404 secs
2010-10-28 14:48:16: 7.2819025 secs

I have done

cat filename | awk '{sum+=$3} END {print "Average = ",sum/NR}'

with the output

Average =  4.49154

to get the average for the entire file, but want to break the average down by hour. I can sneak a grep for the hour before the piping the output to awk, but I'd like to, hopefully, do it with a one liner.

Ideally, the output would be something like

Average 12:00 = _computed_avg_
Average 13:00 = _computed_avg_
Average 14:00 = _computed_avg_

and so on.

Not necessarily looking for an answer, but hoping to be pointed in the right direction.

like image 675
KM. Avatar asked Oct 28 '10 19:10

KM.


1 Answers

I would set the field delimiter to colon, then aggregate in an associative array for the different keys in the array, and finally compute the averages:

gawk -F: 'NF == 4 { sum[$1] += $4; N[$1]++ } 
          END     { for (key in sum) {
                        avg = sum[key] / N[key];
                        printf "%s %f\n", key, avg;
                    } }' filename | sort

On your test data, this gives:

2010-10-28 12 4.348022
2010-10-28 13 3.514688
2010-10-28 14 7.681355

This should produce the correct answer even if the data is not in time order (say you concatenate two log files out of sequence). Note that gawk will sum '3.123 secs' values numerically. The final sort presents the averages in time sequence; there is no guarantee that the keys will be printed in time sequence.

like image 110
Jonathan Leffler Avatar answered Sep 29 '22 10:09

Jonathan Leffler