Suppose I have a file similar to the following:
123 123 234 234 123 345
I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc. So ideally, the output would be like:
123 3 234 2 345 1
The uniq command has a convenient -c option to count the number of occurrences in the input file. This is precisely what we're looking for. However, one thing we must keep in mind is that the uniq command with the -c option works only when duplicated lines are adjacent.
You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .
Assuming there is one number per line:
sort <file> | uniq -c
You can use the more verbose --count
flag too with the GNU version, e.g., on Linux:
sort <file> | uniq --count
This will print duplicate lines only, with counts:
sort FILE | uniq -cd
or, with GNU long options (on Linux):
sort FILE | uniq --count --repeated
on BSD and OSX you have to use grep to filter out unique lines:
sort FILE | uniq -c | grep -v '^ *1 '
For the given example, the result would be:
3 123 2 234
If you want to print counts for all lines including those that appear only once:
sort FILE | uniq -c
or, with GNU long options (on Linux):
sort FILE | uniq --count
For the given input, the output is:
3 123 2 234 1 345
In order to sort the output with the most frequent lines on top, you can do the following (to get all results):
sort FILE | uniq -c | sort -nr
or, to get only duplicate lines, most frequent first:
sort FILE | uniq -cd | sort -nr
on OSX and BSD the final one becomes:
sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With