I am processing large text files (~20MB) containing data delimited by line. Most data entries are duplicated and I want to remove these duplications to only keep one copy. Also, to make the problem slightly more complicated, some entries are repeated with an extra bit of info appended. In this case I need to keep the entry containing the extra info and delete the older versions. e.g. I need to go from this: <pre class="prettyprint"> BOB 123 1DB JIM 456 3DB AX DAVE 789 1DB BOB 123 1DB JIM 456 3DB AX DAVE 789 1DB BOB 123 1DB EXTRA BITS </pre> to this: <pre class="prettyprint"> JIM 456 3DB AX DAVE 789 1DB BOB 123 1DB EXTRA BITS </pre> NB. the final order doesn't matter. What is an efficient way to do this? I can use awk, python or any standard linux command line tool. Thanks.

How about the following (in Python): <pre class="prettyprint"><code>prev = None for line in sorted(open('file')): line = line.strip() if prev is not None and not line.startswith(prev): print prev prev = line if prev is not None: print prev </code></pre> If you find memory usage an issue, you can do the sort as a pre-processing step using Unix <code>sort</code> (which is disk-based) and change the script so that it doesn't read the entire file into memory.

Removing duplicated lines from a txt file

Tags:

python

linux

awk

I am processing large text files (~20MB) containing data delimited by line. Most data entries are duplicated and I want to remove these duplications to only keep one copy.

Also, to make the problem slightly more complicated, some entries are repeated with an extra bit of info appended. In this case I need to keep the entry containing the extra info and delete the older versions.

e.g. I need to go from this:

BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS

to this:

JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS

NB. the final order doesn't matter.

What is an efficient way to do this?

I can use awk, python or any standard linux command line tool.

Thanks.

714

asked Feb 09 '11 17:02

Pete W

1 Answers

How about the following (in Python):

prev = None
for line in sorted(open('file')):
  line = line.strip()
  if prev is not None and not line.startswith(prev):
    print prev
  prev = line
if prev is not None:
  print prev

If you find memory usage an issue, you can do the sort as a pre-processing step using Unix sort (which is disk-based) and change the script so that it doesn't read the entire file into memory.

answered Sep 22 '22 22:09

NPE

Related questions
                            
                                UnicodeDecodeError: 'charmap' codec| Error during installation of pip python-stdnum==1.8
                            
                                Make a list from multiple list
                            
                                Running multiple sites from a single Python web framework [duplicate]
                            
                                Python templates for web designers
                            
                                How do you iterate over a tree?
                            
                                How to truncate matrix using NumPy (Python)
                            
                                python: arbitrary order by
                            
                                Python - Use a Regex to Filter Data
                            
                                Is it possible to make re find the smallest match while using greedy characters [duplicate]
                            
                                using django and twisted together
                            
                                Python ( or general programming ). Why use <> instead of != and are there risks?
                            
                                Python: Create a duplicate of an array
                            
                                python subprocess hide stdout and wait it to complete
                            
                                Python nested dictionary lookup with default values
                            
                                Now to convert this strings to date time object in Python or django?
                            
                                Python CMS to create a video site like youtube? [closed]
                            
                                do not understand closures question in python
                            
                                How do I use a MD5 hash (or other binary data) as a key name?
                            
                                Multiple Tries in Try/Except Block
                            
                                python threads & sockets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With