I need script that sorts a text file and remove the duplicates. Most, if not all, of the examples out there use the <code>sort file1 | uniq > file2</code> approach. In the man sort though, there is an -u option that does this at the time of sorting. Is there a reason to use one over the other? Maybe availability to the -u option? Or memory/speed concern?

They should be equivalent in the simple case, but will behave differently if you're using the <code>-k</code> option to define only certain fields of the input line to use as sort keys. In that case, <code>sort -u</code> will suppress lines which have the same key even if other parts of the line differ, whereas <code>uniq</code> will only suppress lines that are exactly identical. <pre class="prettyprint"><code>$ cat example foo baz quux ping foo bar $ sort -k 1,1 --stable example # use just the first word as sort key foo baz foo bar quux ping $ sort -k 1,1 --stable -u example # suppress lines with the same first word foo baz quux ping </code></pre> but <pre class="prettyprint"><code>$ sort -k 1,1 --stable example | uniq foo baz foo bar quux ping </code></pre>

What is the difference between 'sort -u' and 'uniq'?

Tags:

bash

sorting

uniq

I need script that sorts a text file and remove the duplicates. Most, if not all, of the examples out there use the sort file1 | uniq > file2 approach. In the man sort though, there is an -u option that does this at the time of sorting.

Is there a reason to use one over the other? Maybe availability to the -u option? Or memory/speed concern?

740

asked Mar 09 '14 20:03

Stoinov

3 Answers

They should be equivalent in the simple case, but will behave differently if you're using the -k option to define only certain fields of the input line to use as sort keys. In that case, sort -u will suppress lines which have the same key even if other parts of the line differ, whereas uniq will only suppress lines that are exactly identical.

$ cat example 
foo baz
quux ping
foo bar
$ sort -k 1,1 --stable example # use just the first word as sort key
foo baz
foo bar
quux ping
$ sort -k 1,1 --stable -u example # suppress lines with the same first word
foo baz
quux ping

but

$ sort -k 1,1 --stable example | uniq
foo baz
foo bar
quux ping

answered Oct 17 '22 10:10

Ian Roberts

I'm not sure that it's about availability. Most systems I've ever seen have sort and uniq as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort has the -u option.

Technically, using a linux pipe (|) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.

If you go to the source code for sort, which comes in the coreutils package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq code.

To see how it works follow the link to sort's source and see the functions below this comment:

 /* If uniquified output is turned on, output only the first of
   an identical series of lines. */

Although I believe sort -u should be faster, the performance gains are really going to be minimal unless you're running sort | uniq on huge files, as it will have to read through the entire file again.

answered Oct 17 '22 10:10

cmrust

One difference is 'uniq -c' can count (and print) the number of matches. You lose this ability when you use 'sort -c' for sorting.

answered Oct 17 '22 11:10

Oktay

Related questions
                            
                                Kill the previous command in a pipeline
                            
                                -bash: command substitution: line XX: syntax error: unexpected end of file
                            
                                Listing all directories except one
                            
                                Change linux password in a script, quietly
                            
                                Why do I get different bash script results when invoked with 'set -x', and how do I fix it?
                            
                                Duplicate photo searching with compare only pure imagedata and image similarity?
                            
                                Comparing three .csv files and outputting similarities
                            
                                Grep to find a file that contains a string in a directory
                            
                                Bash script to remove 'x' amount of characters the end of multiple filenames in a directory?
                            
                                BASH if directory contains files or doesn't
                            
                                Difference between $@ and $* in bash scripting [duplicate]
                            
                                In bash, how to use a variable as part of the name of another variable?
                            
                                How can I resolve this error in shell scripting: "read: Illegal option -t"?
                            
                                How to get this init.d script to start at server restart?
                            
                                Using command substitution or similar, but still having script exit (using set -e)
                            
                                Attach to MySQL client entirely via FIFOs
                            
                                What happens when I use `&` with a function in a Bash script?
                            
                                How do I call a function in bash if there is an alias by the same name?
                            
                                Bash array creation: ("$@") vs ($@)
                            
                                How to send the syslog output to stdout? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With