Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between 'sort -u' and 'uniq'?

Tags:

bash

sorting

uniq

I need script that sorts a text file and remove the duplicates. Most, if not all, of the examples out there use the sort file1 | uniq > file2 approach. In the man sort though, there is an -u option that does this at the time of sorting.

Is there a reason to use one over the other? Maybe availability to the -u option? Or memory/speed concern?

like image 740
Stoinov Avatar asked Mar 09 '14 20:03

Stoinov


People also ask

What does sort uniq do?

The sort's uniq option gets rid of all duplicates, whereas the uniq command does not get rid of all duplicates.

Why do you need to sort before uniq?

Checking the man page for uniq: Repeated lines in the input will not be detected if they are not adjacent, so it may be necessary to sort the files first. Alternatively, taking the man page suggestion, sorting the list before calling uniq will remove all of the duplicates.

What does uniq mean in Unix?

The uniq command in UNIX is a command line utility for reporting or filtering repeated lines in a file. It can remove duplicates, show a count of occurrences, show only repeated lines, ignore certain characters and compare on specific fields.

What is Linux sort?

The sort command is used in Linux to print the output of a file in given order. This command processes on your data (the content of the file or output of any command) and reorders it in the specified way, which helps us to read the data efficiently.


3 Answers

They should be equivalent in the simple case, but will behave differently if you're using the -k option to define only certain fields of the input line to use as sort keys. In that case, sort -u will suppress lines which have the same key even if other parts of the line differ, whereas uniq will only suppress lines that are exactly identical.

$ cat example 
foo baz
quux ping
foo bar
$ sort -k 1,1 --stable example # use just the first word as sort key
foo baz
foo bar
quux ping
$ sort -k 1,1 --stable -u example # suppress lines with the same first word
foo baz
quux ping

but

$ sort -k 1,1 --stable example | uniq
foo baz
foo bar
quux ping
like image 82
Ian Roberts Avatar answered Oct 17 '22 10:10

Ian Roberts


I'm not sure that it's about availability. Most systems I've ever seen have sort and uniq as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort has the -u option.

Technically, using a linux pipe (|) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.

If you go to the source code for sort, which comes in the coreutils package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq code.

To see how it works follow the link to sort's source and see the functions below this comment:

 /* If uniquified output is turned on, output only the first of
   an identical series of lines. */

Although I believe sort -u should be faster, the performance gains are really going to be minimal unless you're running sort | uniq on huge files, as it will have to read through the entire file again.

like image 33
cmrust Avatar answered Oct 17 '22 10:10

cmrust


One difference is 'uniq -c' can count (and print) the number of matches. You lose this ability when you use 'sort -c' for sorting.

like image 29
Oktay Avatar answered Oct 17 '22 11:10

Oktay