I need script that sorts a text file and remove the duplicates.
Most, if not all, of the examples out there use the sort file1 | uniq > file2
approach.
In the man sort though, there is an -u option that does this at the time of sorting.
Is there a reason to use one over the other? Maybe availability to the -u option? Or memory/speed concern?
The sort's uniq option gets rid of all duplicates, whereas the uniq command does not get rid of all duplicates.
Checking the man page for uniq: Repeated lines in the input will not be detected if they are not adjacent, so it may be necessary to sort the files first. Alternatively, taking the man page suggestion, sorting the list before calling uniq will remove all of the duplicates.
The uniq command in UNIX is a command line utility for reporting or filtering repeated lines in a file. It can remove duplicates, show a count of occurrences, show only repeated lines, ignore certain characters and compare on specific fields.
The sort command is used in Linux to print the output of a file in given order. This command processes on your data (the content of the file or output of any command) and reorders it in the specified way, which helps us to read the data efficiently.
They should be equivalent in the simple case, but will behave differently if you're using the -k
option to define only certain fields of the input line to use as sort keys. In that case, sort -u
will suppress lines which have the same key even if other parts of the line differ, whereas uniq
will only suppress lines that are exactly identical.
$ cat example
foo baz
quux ping
foo bar
$ sort -k 1,1 --stable example # use just the first word as sort key
foo baz
foo bar
quux ping
$ sort -k 1,1 --stable -u example # suppress lines with the same first word
foo baz
quux ping
but
$ sort -k 1,1 --stable example | uniq
foo baz
foo bar
quux ping
I'm not sure that it's about availability. Most systems I've ever seen have sort
and uniq
as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort
has the -u
option.
Technically, using a linux pipe (|
) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.
If you go to the source code for sort
, which comes in the coreutils
package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq
code.
To see how it works follow the link to sort's source and see the functions below this comment:
/* If uniquified output is turned on, output only the first of an identical series of lines. */
Although I believe sort -u
should be faster, the performance gains are really going to be minimal unless you're running sort | uniq
on huge files, as it will have to read through the entire file again.
One difference is 'uniq -c' can count (and print) the number of matches. You lose this ability when you use 'sort -c' for sorting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With