Which word occurs most frequently in a text file?

Question

There's a txt file with a word in every line.

"word1"
"word1"
"word2"
"word2"
"word1"

I'd like to get which word occurs the most, but I have no idea how to get that, any ideas?

mklement0 · Accepted Answer

^{Note: See bottom for case-insensitive solutions.}

A combination of sort, uniq, head, and cut calls is conceptually simplest, and also extensible, but here's a single-pass awk solution that is probably more efficient, although more complex, and limited to finding only the "winner" and with unpredictable ordering in the event of ties:

awk '{ if (++words[$0] > max) { max = words[$0]; maxW=$0 } } END { print maxW }' file

With the sample input, this returns "word2" (including the double quotes).
Use print max, maxW to also output the count.

In the event of a tie, among the words that share the max. count, it is the one whose last occurrence happens to come first in the input file that "wins" (is output).

Here's the multi-utility equivalent, which allows extending the solution to the top N words and also offers predictable ordering among the winners in the event of a tie:

$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 1 | cut -d\" -f2
word2

In the event of a tie, the alphabetically first word among the ones that share the max. count is printed.

Note: For convenience, the above uses cut to extract the word without the enclosing double quotes.

To preserve the double quotes, use awk instead of cut:

$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 1 | awk '{print $NF}'
"word2"

Omitting the last pipeline segment and modifying head's -n 1 option allows you to see how many occurrences of each word were found and to find the top N words (including double quotes); e.g., to see the top 10 (with the sample input, you only get 2):

$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 10
   3 "word1"
   2 "word2"

A note on the sort call, sort -k1,1nr -k2b:

Explicitly stating the sort fields is good practice - both for efficiency and to avoid unexpected results:

-k1,1nr sorts primarily by 1st whitespace-separated field (k1,1), numerically (-n), in reverse order (r).
- Note the explicit end index in -k1,1, as just -k1 would sort everything starting from field 1 through the end of the line.
-k2b then sorts secondarily starting with the 2nd whitespace-separated field through the end of the line (-k2), ignoring leading whitespace (b; the whitespace that separates the fields) and performing lexical (alphabetic) sorting.

Newer versions of GNU sort (not the one on macOS, unfortunately) have a helpful --debug option that visualizes how each line is broken into keys during sorting.

Using just sort or sort -nr to sort the whole line is tempting, but doesn't necessarily yield the expected results:

Just sort sorts the whole line lexically (alphabetically), in ascending order; due to the padded fixed-width nature of the word counts in the 1st field the results are still effectively numerically sorted, but in the event of a tie it is the alphabetically last word that is output.
Just sort -rn applies numerical sorting to the whole line, in descending order. With numerical sorting field parsing stops at the longest prefix that can be interpreted as a number, an implicit feature called last-resort comparison (can be turned off with -n) sorts the rest of the line alphabetically (in reverse order, in this case). It is therefore also the alphabetically last word that is output in the event of a tie.

Case-insensitive variants:

Note that the input is transformed to all-lowercase for simplicity.

awk

awk '{ $0=tolower($0); if (++wds[$0] > max) { max = wds[$0]; maxW=$0 } } END { print maxW }' file

sort + uniq + head + cut

tr '[:upper:]' '[:lower:]' < file |
  sort | uniq -c | sort -k1,1nr -k2b | head -n 1 | cut -d\" -f2

Bela Vizer · Answer

try with something like this: cat test | sort | uniq -c

cat reads the file
sort sorts it for uniq command
uniq with -c "prefix lines by the number of occurrences"

Which word occurs most frequently in a text file?

Tags:

bash

shell

imcsi97

2 Answers

mklement0

Bela Vizer

Recent Activity

Donate For Us

Which word occurs most frequently in a text file?

Tags:

bash

shell

imcsi97

2 Answers

mklement0

Bela Vizer

Related questions

Recent Activity

Donate For Us