Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to aggregate counts in a bash one-liner

Tags:

bash

unix

uniq

I often use sort | uniq -c to make count statistics. Now, if I have two files with such count statistics, I would like to put them together and add the counts. (I know I could append the original files and count there, but lets assume only the count files are accessible).

For example given:

a.cnt:

   1 a
   2 c

b.cnt:

   2 b
   1 c

I would like to concatenate and get the following output:

   1 a
   2 b
   3 c

What's the shortest way to do this in the shell?

Edit:

Thanks for the answers so far!

Some possible side-aspects one might want to consider additionally:

  • what if a, b, c are arbritrary strings, containing arbitrary white-spaces?
  • what if the files are too big to fit in memory? Is there some sort | uniq -c-style command line option for this case that only looks at two lines at a time?
like image 305
benroth Avatar asked Mar 13 '14 15:03

benroth


People also ask

How do I count strings in bash?

'#' symbol can be used to count the length of the string without using any command. `expr` command can be used by two ways to count the length of a string. Without `expr`, `wc` and `awk` command can also be used to count the length of a string.

How do I count the number of lines in a directory in Linux?

The easiest way to count files in a directory on Linux is to use the “ls” command and pipe it with the “wc -l” command. The “wc” command is used on Linux in order to print the bytes, characters or newlines count.

How do I count lines in a file?

Using “wc -l” There are several ways to count lines in a file. But one of the easiest and widely used way is to use “wc -l”. The wc utility displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is specified) to the standard output. 1.


2 Answers

This can work for any given number of files:

$ cat a.cnt b.cnt | awk '{a[$2]+=$1} END{for (i in a) print a[i],i}'
1 a
2 b
3 c

So if you have let's say 10 files, you just have to do cat f1 f2 ... and then pipe this awk.

If the file names happen to share a pattern, you can also do (thanks Adrian Frühwirth!):

awk '{a[$2]+=$1} END{for (i in a) print a[i],i}' *cnt

So for example this will take into consideration all the files whose extension is cnt.


Some possible side-aspects one might want to consider additionally:

  • what if a, b, c are arbritrary strings, containing arbitrary white-spaces?
  • what if the files are too big to fit in memory? Is there some sort | uniq -c-style command line option for this case that only looks at two lines at a time?

In that case, you can use the rest of the columns as indexes for the counter:

awk '{count=$1; $1=""; a[$0]+=count} END{for (i in a) print a[i],i}' *cnt

Note that in fact you don't need to sort | uniq -c and redirect to a cnt file and then perform this re-counting. You can do it all together with something like this:

awk '{a[$0]++} END{for (i in a) print a[i], i}' file

Example

$ cat a.cnt
   1 and some
   2 text here

$ cat b.cnt
   4 and some
   4 and other things
   2 text here
   9 blabla

$ cat *cnt | awk '{count=$1; $1=""; a[$0]+=count} END{for (i in a) print a[i],i}'
4  text here
9  blabla
4  and some
4  and other things

Regarding second comment:

$ cat b
and some
text here
and some
and other things
text here
blabla

$ awk '{a[$0]++} END{for (i in a) print a[i], i}' b
2 and some
2 text here
1 and other things
1 blabla
like image 85
fedorqui 'SO stop harming' Avatar answered Sep 29 '22 04:09

fedorqui 'SO stop harming'


Using awk:

awk 'FNR==NR{a[$2]=$1;next} $2 in a{a[$2]+=$1}1' a.cnt b.cnt
1 a
2 b
3 c
like image 34
anubhava Avatar answered Sep 29 '22 04:09

anubhava