Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the count of unique values in a column in bash

I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?

It can use any common command line language like awk, perl, python etc.

like image 487
sfactor Avatar asked Feb 07 '11 13:02

sfactor


People also ask

How do you count values in Unix?

Wc Command in Linux (Count Number of Lines, Words, and Characters) On Linux and Unix-like operating systems, the wc command allows you to count the number of lines, words, characters, and bytes of each given file or standard input and print the result.


3 Answers

To see a frequency count for column two (for example):

awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr

fileA.txt

z    z    a
a    b    c
w    d    e

fileB.txt

t    r    e
z    d    a
a    g    c

fileC.txt

z    r    a
v    d    c
a    m    c

Result:

  3 d
  2 r
  1 z
  1 m
  1 g
  1 b
like image 185
Dennis Williamson Avatar answered Oct 16 '22 14:10

Dennis Williamson


Here is a way to do it in the shell:

FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr

This is the sort of thing bash is great at.

like image 87
Thedward Avatar answered Oct 16 '22 13:10

Thedward


The GNU site suggests this nice awk script, which prints both the words and their frequency.

Possible changes:

  • You can pipe through sort -nr (and reverse word and freq[word]) to see the result in descending order.
  • If you want a specific column, you can omit the for loop and simply write freq[3]++ - replace 3 with the column number.

Here goes:

 # wordfreq.awk --- print list of word frequencies

 {
     $0 = tolower($0)    # remove case distinctions
     # remove punctuation
     gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
     for (i = 1; i <= NF; i++)
         freq[$i]++
 }

 END {
     for (word in freq)
         printf "%s\t%d\n", word, freq[word]
 }
like image 9
Adam Matan Avatar answered Oct 16 '22 13:10

Adam Matan