Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bash command for group by count

Tags:

bash

shell

awk

I have a file in the following format

abc|1
def|2
abc|8
def|3
abc|5
xyz|3

I need to group by these words in the first column and sum the value of the second column. For instance, the output of this file should be

abc|14
def|5
xyz|3

Explanation: the corresponding values for word "abc" are 1, 8, and 5. By adding these numbers, the sum comes out to be 14 and the output becomes "abc|14". Similarly, for word "def", the corresponding values are 2 and 3. Summing up these, the final output comes out to be "def|5".

Thank you very much for the help :)

I tried the following command

awk -F "|" '{arr[$1]+=$2} END {for (i in arr) {print i"|"arr[i]}}' filename

another command which I found was

awk -F "," 'BEGIN { FS=OFS=SUBSEP=","}{arr[$1]+=$2 }END {for (i in arr) print i,arr[i]}' filename

Both didn't show me the intended results. Although I'm also in doubt of the working of these commands as well.

like image 729
Tushar Gandhi Avatar asked Mar 09 '23 07:03

Tushar Gandhi


2 Answers

Short GNU datamash solution:

datamash -s -t\| -g1 sum 2 < filename

The output:

abc|14
def|5
xyz|3

  • -t\| - field separator

  • -g1 - group by the 1st column

  • sum 2 - sum up values of the 2nd column

like image 121
RomanPerekhrest Avatar answered Mar 24 '23 16:03

RomanPerekhrest


I will just add an answer to fix the sorting issue you had, in your Awk logic, you don't need to use sort/uniq piped to the output of Awk, but process in Awk itself.

Referring to GNU Awk Using Predefined Array Scanning Orders with gawk, you can use the PROCINFO["sorted_in"] variable(gawk specific) to control how you want Awk to sort your final output.

Referring to the section below,

@ind_str_asc Order by indices in ascending order compared as strings; this is the most basic sort. (Internally, array indices are always strings, so with a[2*5] = 1 the index is 10 rather than numeric 10.)

So using this in your requirement in the END clause just do,

END{PROCINFO["sorted_in"]="@ind_str_asc"; for (i in unique) print i,unique[i]}

with your full command being,

awk '
    BEGIN{FS=OFS="|"}{
        unique[$1]+=$2; 
        next
    }
    END{
        PROCINFO["sorted_in"]="@ind_str_asc"; 
        for (i in unique) 
            print i,unique[i]
    }' file
like image 20
Inian Avatar answered Mar 24 '23 16:03

Inian