Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the most common line in a file in bash

I have a file of strings:

string-string-123
string-string-123
string-string-123
string-string-12345
string-string-12345
string-string-12345-123

How do I retrieve the most common line in bash (string-string-123)?

like image 663
Alex Avatar asked Mar 28 '15 18:03

Alex


People also ask

What is $() in bash script?

$() Command Substitution According to the official GNU Bash Reference manual: “Command substitution allows the output of a command to replace the command itself.

What is $_ in bash?

$_ (dollar underscore) is another special bash parameter and used to reference the absolute file name of the shell or bash script which is being executed as specified in the argument list. This bash parameter is also used to hold the name of mail file while checking emails.

What is $s in bash?

From man bash : -s If the -s option is present, or if no arguments remain after option processing, then commands are read from the standard input. This option allows the positional parameters to be set when invoking an interactive shell.

How do I find the common line of two files?

Use comm -12 file1 file2 to get common lines in both files. You may also needs your file to be sorted to comm to work as expected. Or using grep command you need to add -x option to match the whole line as a matching pattern. The F option is telling grep that match pattern as a string not a regex match.


2 Answers

You can use sort with uniq

sort file | uniq -c | sort -n -r
like image 118
Grzegorz Żur Avatar answered Oct 23 '22 00:10

Grzegorz Żur


You could use awk to do this:

awk '{++a[$0]}END{for(i in a)if(a[i]>max){max=a[i];k=i}print k}' file

The array a keeps a count of each line. Once the file has been read, we loop through it and find the line with the maximum count.

Alternatively, you can skip the loop in the END block by assigning the line during the processing of the file:

awk 'max < ++c[$0] {max = c[$0]; line = $0} END {print line}' file

Thanks to glenn jackman for this useful suggestion.


It has rightly been pointed out that the two approaches above will only print out one of the most frequently occurring lines in the case of a tie. The following version will print out all of the most frequently occurring lines:

awk 'max<++c[$0] {max=c[$0]} END {for(i in c)if(c[i]==max)print i}' file
like image 20
Tom Fenech Avatar answered Oct 23 '22 02:10

Tom Fenech