I'd like to check if two rows start with the same number in the 1st column, if that happens, then the average of the 2nd column should be displayed. File example:
01  21    6    10%       93.3333%
01  22    50   83.3333%  93.3333%
02  20.5  23   18.1102%  96.8504%
02  21.5  100  78.7402%  96.8504%
03  22.2  0    0%        100%
03  21.2  29   100%      100%
04  22.5  1    5.55556%  100%
04  23.5  17   94.4444%  100%
05  22.7  9    7.82609%  100%
05  21.7  106  92.1739%  100%
06  23    11   17.4603%  96.8254%
06  22    50   79.3651%  96.8254%
07  20.5  14   18.6667%  96%
07  21.5  58   77.3333%  96%
08  21.8  4    100%      100%
09  22.6  0    0%        100%
09  21.6  22   100%      100%
For instance, the two first lines start with 01, but there is only one line starting with 08 (15th line). Therefore, the output based on these two cases should be:
01 21.5
...
...
...
08 21.8
...
...
...
I ended up with the following awk line, which works great when the file always has two similar lines, but it fails using the file shown above (because the 15th line):
awk '{sum+=$2} (NR%2)==0{print sum/2; sum=0;}'
Any hint is welcomed,
This awk should work:
awk 'function dump(){if (n>0) printf "%s%s%.2f\n", p, OFS, sum/n}
     NR>1 && $1 != p{dump(); sum=n=0} {p=$1; sum+=$2; n++} END{dump()}' file
01 21.5
02 21.0
03 21.7
04 23.0
05 22.2
06 22.5
07 21.0
08 21.8
09 22.1
Explanation: We are using 3 variables:
p -> to hold previous row's $1 value
n -> count of similar $1 values
sum -> is sum of $2 values for similar $1 rows
How it works:
NR>1 && $1 != p     # when row #1 > 1 and prev $1 is not current $1
dump()              # function is to print formatted value of $1 and average
p=$1; sum+=$2; n++  # sets p to $1, adds current $2 to sum and increments n
Using GNU awk
gawk '
    {sum[$1]+=$2; n[$1]++} 
    END {
        PROCINFO["sorted_in"] = "@ind_num_asc"
        for (key in sum) print key, sum[key]/n[key]
    }
' file
01 21.5
02 21
03 21.7
04 23
05 22.2
06 22.5
07 21
08 21.8
09 22.1
The "PROCINFO" line makes the array traversal sorted my index numerically. Otherwise the output would appear random.
awk with piped sort
awk '{s[$1]+=$2;c[$1]++} END{for(i in s) print i, s[i]/c[i]}' file | sort
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With