There's a txt file with a word in every line.
"word1"
"word1"
"word2"
"word2"
"word1"
I'd like to get which word occurs the most, but I have no idea how to get that, any ideas?
Note: See bottom for case-insensitive solutions.
A combination of sort
, uniq
, head
, and cut
calls is conceptually simplest, and also extensible, but here's a single-pass awk
solution that is probably more efficient, although more complex, and limited to finding only the "winner" and with unpredictable ordering in the event of ties:
awk '{ if (++words[$0] > max) { max = words[$0]; maxW=$0 } } END { print maxW }' file
With the sample input, this returns "word2"
(including the double quotes).
Use print max, maxW
to also output the count.
In the event of a tie, among the words that share the max. count, it is the one whose last occurrence happens to come first in the input file that "wins" (is output).
Here's the multi-utility equivalent, which allows extending the solution to the top N words and also offers predictable ordering among the winners in the event of a tie:
$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 1 | cut -d\" -f2
word2
In the event of a tie, the alphabetically first word among the ones that share the max. count is printed.
Note: For convenience, the above uses cut
to extract the word without the enclosing double quotes.
To preserve the double quotes, use awk
instead of cut
:
$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 1 | awk '{print $NF}'
"word2"
Omitting the last pipeline segment and modifying head
's -n 1
option allows you to see how many occurrences of each word were found and to find the top N words (including double quotes); e.g., to see the top 10 (with the sample input, you only get 2):
$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 10
3 "word1"
2 "word2"
A note on the sort
call, sort -k1,1nr -k2b
:
Explicitly stating the sort fields is good practice - both for efficiency and to avoid unexpected results:
-k1,1nr
sorts primarily by 1st whitespace-separated field (k1,1
), numerically (-n
), in reverse order (r
).
-k1,1
, as just -k1
would sort everything starting from field 1 through the end of the line.-k2b
then sorts secondarily starting with the 2nd whitespace-separated field through the end of the line (-k2
), ignoring leading whitespace (b
; the whitespace that separates the fields) and performing lexical (alphabetic) sorting.
Newer versions of GNU sort
(not the one on macOS, unfortunately) have a helpful --debug
option that visualizes how each line is broken into keys during sorting.
Using just sort
or sort -nr
to sort the whole line is tempting, but doesn't necessarily yield the expected results:
Just sort
sorts the whole line lexically (alphabetically), in ascending order; due to the padded fixed-width nature of the word counts in the 1st field the results are still effectively numerically sorted, but in the event of a tie it is the alphabetically last word that is output.
Just sort -rn
applies numerical sorting to the whole line, in descending order. With numerical sorting field parsing stops at the longest prefix that can be interpreted as a number, an implicit feature called last-resort comparison (can be turned off with -n
) sorts the rest of the line alphabetically (in reverse order, in this case). It is therefore also the alphabetically last word that is output in the event of a tie.
Case-insensitive variants:
Note that the input is transformed to all-lowercase for simplicity.
awk
awk '{ $0=tolower($0); if (++wds[$0] > max) { max = wds[$0]; maxW=$0 } } END { print maxW }' file
sort
+ uniq
+ head
+ cut
tr '[:upper:]' '[:lower:]' < file |
sort | uniq -c | sort -k1,1nr -k2b | head -n 1 | cut -d\" -f2
try with something like this: cat test | sort | uniq -c
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With