Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicated lines from a txt file

Tags:

python

linux

awk

I am processing large text files (~20MB) containing data delimited by line. Most data entries are duplicated and I want to remove these duplications to only keep one copy.

Also, to make the problem slightly more complicated, some entries are repeated with an extra bit of info appended. In this case I need to keep the entry containing the extra info and delete the older versions.

e.g. I need to go from this:

BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS
to this:
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS
NB. the final order doesn't matter.

What is an efficient way to do this?

I can use awk, python or any standard linux command line tool.

Thanks.

like image 714
Pete W Avatar asked Feb 09 '11 17:02

Pete W


People also ask

How do I delete duplicate lines in files?

If you don't need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.

How do I remove duplicates in notepad?

To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, place: ^(. *?)


1 Answers

How about the following (in Python):

prev = None
for line in sorted(open('file')):
  line = line.strip()
  if prev is not None and not line.startswith(prev):
    print prev
  prev = line
if prev is not None:
  print prev

If you find memory usage an issue, you can do the sort as a pre-processing step using Unix sort (which is disk-based) and change the script so that it doesn't read the entire file into memory.

like image 64
NPE Avatar answered Sep 22 '22 22:09

NPE