Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which word occurs most frequently in a text file?

Tags:

bash

shell

There's a txt file with a word in every line.

"word1"
"word1"
"word2"
"word2"
"word1"

I'd like to get which word occurs the most, but I have no idea how to get that, any ideas?

like image 461
imcsi97 Avatar asked Feb 06 '23 16:02

imcsi97


2 Answers

Note: See bottom for case-insensitive solutions.

A combination of sort, uniq, head, and cut calls is conceptually simplest, and also extensible, but here's a single-pass awk solution that is probably more efficient, although more complex, and limited to finding only the "winner" and with unpredictable ordering in the event of ties:

awk '{ if (++words[$0] > max) { max = words[$0]; maxW=$0 } } END { print maxW }' file

With the sample input, this returns "word2" (including the double quotes).
Use print max, maxW to also output the count.

In the event of a tie, among the words that share the max. count, it is the one whose last occurrence happens to come first in the input file that "wins" (is output).


Here's the multi-utility equivalent, which allows extending the solution to the top N words and also offers predictable ordering among the winners in the event of a tie:

$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 1 | cut -d\" -f2
word2

In the event of a tie, the alphabetically first word among the ones that share the max. count is printed.

Note: For convenience, the above uses cut to extract the word without the enclosing double quotes.

To preserve the double quotes, use awk instead of cut:

$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 1 | awk '{print $NF}'
"word2"

Omitting the last pipeline segment and modifying head's -n 1 option allows you to see how many occurrences of each word were found and to find the top N words (including double quotes); e.g., to see the top 10 (with the sample input, you only get 2):

$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 10
   3 "word1"
   2 "word2"

A note on the sort call, sort -k1,1nr -k2b:

Explicitly stating the sort fields is good practice - both for efficiency and to avoid unexpected results:

  • -k1,1nr sorts primarily by 1st whitespace-separated field (k1,1), numerically (-n), in reverse order (r).

    • Note the explicit end index in -k1,1, as just -k1 would sort everything starting from field 1 through the end of the line.
  • -k2b then sorts secondarily starting with the 2nd whitespace-separated field through the end of the line (-k2), ignoring leading whitespace (b; the whitespace that separates the fields) and performing lexical (alphabetic) sorting.

Newer versions of GNU sort (not the one on macOS, unfortunately) have a helpful --debug option that visualizes how each line is broken into keys during sorting.


Using just sort or sort -nr to sort the whole line is tempting, but doesn't necessarily yield the expected results:

  • Just sort sorts the whole line lexically (alphabetically), in ascending order; due to the padded fixed-width nature of the word counts in the 1st field the results are still effectively numerically sorted, but in the event of a tie it is the alphabetically last word that is output.

  • Just sort -rn applies numerical sorting to the whole line, in descending order. With numerical sorting field parsing stops at the longest prefix that can be interpreted as a number, an implicit feature called last-resort comparison (can be turned off with -n) sorts the rest of the line alphabetically (in reverse order, in this case). It is therefore also the alphabetically last word that is output in the event of a tie.


Case-insensitive variants:

Note that the input is transformed to all-lowercase for simplicity.

  • awk
awk '{ $0=tolower($0); if (++wds[$0] > max) { max = wds[$0]; maxW=$0 } } END { print maxW }' file
  • sort + uniq + head + cut
tr '[:upper:]' '[:lower:]' < file |
  sort | uniq -c | sort -k1,1nr -k2b | head -n 1 | cut -d\" -f2
like image 184
mklement0 Avatar answered Feb 08 '23 06:02

mklement0


try with something like this: cat test | sort | uniq -c

  • cat reads the file
  • sort sorts it for uniq command
  • uniq with -c "prefix lines by the number of occurrences"
like image 43
Bela Vizer Avatar answered Feb 08 '23 05:02

Bela Vizer