I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?
It can use any common command line language like awk, perl, python etc.
Wc Command in Linux (Count Number of Lines, Words, and Characters) On Linux and Unix-like operating systems, the wc command allows you to count the number of lines, words, characters, and bytes of each given file or standard input and print the result.
To see a frequency count for column two (for example):
awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr
fileA.txt
z z a
a b c
w d e
fileB.txt
t r e
z d a
a g c
fileC.txt
z r a
v d c
a m c
Result:
3 d
2 r
1 z
1 m
1 g
1 b
Here is a way to do it in the shell:
FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr
This is the sort of thing bash is great at.
The GNU site suggests this nice awk script, which prints both the words and their frequency.
Possible changes:
sort -nr
(and reverse word
and freq[word]
) to see the result in descending order.freq[3]++
- replace 3 with the column number.Here goes:
# wordfreq.awk --- print list of word frequencies
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With