Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate lines without sorting [duplicate]

I have a utility script in Python:

#!/usr/bin/env python import sys unique_lines = [] duplicate_lines = [] for line in sys.stdin:   if line in unique_lines:     duplicate_lines.append(line)   else:     unique_lines.append(line)     sys.stdout.write(line) # optionally do something with duplicate_lines 

This simple functionality (uniq without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn't it? Maybe a combination of filters in a pipe?

Reason for asking: needing this functionality on a system on which I cannot execute Python from anywhere.

like image 482
Robottinosino Avatar asked Jul 17 '12 23:07

Robottinosino


People also ask

How do I remove duplicates without sorting in Excel?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.

How do I get rid of duplicate lines?

Select the range you want to remove duplicate rows. If you want to delete all duplicate rows in the worksheet, just hold down Ctrl + A key to select the entire sheet. 2. On Data tab, click Remove Duplicates in the Data Tools group.


2 Answers

The UNIX Bash Scripting blog suggests:

awk '!x[$0]++' 

This command is telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x is incremented and the line printed if the content of that node was not (!) previously set.

like image 161
Michael Hoffman Avatar answered Oct 14 '22 14:10

Michael Hoffman


A late answer - I just ran into a duplicate of this - but perhaps worth adding...

The principle behind @1_CR's answer can be written more concisely, using cat -n instead of awk to add line numbers:

cat -n file_name | sort -uk2 | sort -n | cut -f2- 
  • Use cat -n to prepend line numbers
  • Use sort -u remove duplicate data (-k2 says 'start at field 2 for sort key')
  • Use sort -n to sort by prepended number
  • Use cut to remove the line numbering (-f2- says 'select field 2 till end')
like image 45
Digital Trauma Avatar answered Oct 14 '22 14:10

Digital Trauma