Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find duplicate lines in a file and count how many time each line was duplicated?

Suppose I have a file similar to the following:

123  123  234  234  123  345 

I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc. So ideally, the output would be like:

123  3  234  2  345  1 
like image 493
user839145 Avatar asked Jul 15 '11 19:07

user839145


People also ask

How do I count duplicate lines in Linux?

The uniq command has a convenient -c option to count the number of occurrences in the input file. This is precisely what we're looking for. However, one thing we must keep in mind is that the uniq command with the -c option works only when duplicated lines are adjacent.

How do you count duplicate lines in Python?

You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .


2 Answers

Assuming there is one number per line:

sort <file> | uniq -c 

You can use the more verbose --count flag too with the GNU version, e.g., on Linux:

sort <file> | uniq --count 
like image 115
wonk0 Avatar answered Oct 26 '22 23:10

wonk0


This will print duplicate lines only, with counts:

sort FILE | uniq -cd 

or, with GNU long options (on Linux):

sort FILE | uniq --count --repeated 

on BSD and OSX you have to use grep to filter out unique lines:

sort FILE | uniq -c | grep -v '^ *1 ' 

For the given example, the result would be:

  3 123   2 234 

If you want to print counts for all lines including those that appear only once:

sort FILE | uniq -c 

or, with GNU long options (on Linux):

sort FILE | uniq --count 

For the given input, the output is:

  3 123   2 234   1 345 

In order to sort the output with the most frequent lines on top, you can do the following (to get all results):

sort FILE | uniq -c | sort -nr 

or, to get only duplicate lines, most frequent first:

sort FILE | uniq -cd | sort -nr 

on OSX and BSD the final one becomes:

sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr 
like image 30
Andrea Avatar answered Oct 27 '22 01:10

Andrea