What's the fastest way to remove duplicate lines in a txt file(and also some lines which contain specific strings) using python?

Tags:

The txt is about 22,000 lines, and it's about 3.5MB. There are lots of duplicate lines in it. I simply want to remove the duplicate lines and also some lines which include specific strings not needed.

My way is to read the file into a big list using readlines() method, then read the file as a big string using read() method. Iterate the list, count occurrence, replace the line with ""(empty string). It took me 10 minutes to finish the job?!

Is there any fast way to do this?

Thanks a lot!

975

asked Oct 22 '10 11:10

Shane

2 Answers

I almost always do file processing using generators. This makes for code that's fast, easy to modify, and easy to test.

First, build a generator that removes duplicates:

def remove_duplicates(seq):
    found = set()
    for item in seq:
        if item in found:
            continue
        found.add(item)
        yield item

Does it work?

>>> print "\n".join(remove_duplicates(["aa", "bb", "cc", "aa"]))
aa
bb
cc

Apparently so. Next, create a function that tells you whether or not a line is OK:

def is_line_ok(line):
    if "bad text1" in line:
        return False
    if "bad text2" in line:
        return False
    return True

Does this work?

>>> is_line_ok("this line contains bad text2.")
False
>>> is_line_ok("this line's ok.")
True
>>>

So now we can use remove_duplicates and itertools.ifilter with our function:

>>> seq = ["OK", "bad text2", "OK", "Also OK"]
>>> print "\n".join(remove_duplicates(ifilter(is_line_ok, seq)))
OK
Also OK

This method works on any iterable that returns strings, including files:

with open(input_file, 'r') as f_in:
    with open(output_file, 'w') as f_out:
       f_out.writelines(remove_duplicates(ifilter(is_line_ok, f_in)))

137

answered Sep 19 '22 14:09

Robert Rossney

list(set(line for line in file.readlines()
         if 'badstring' not in line
         and 'garbage' not in line))

Also, a regex might be faster than multiple not in tests.

answered Sep 23 '22 14:09

Marcelo Cantos

Related questions
                            
                                python curses.newwin not working
                            
                                uwsgi + django via Nginx - uwsgi settings/spawn?
                            
                                What is the relation between PATH_MAX and NAME_MAX, and how do I obtain?
                            
                                Good framework for live charting in Python?
                            
                                Python: best/efficient way of finding a list of words in a text?
                            
                                How do I eliminate Windows consoles from spawned processes in Python (2.7)? [duplicate]
                            
                                Can't get mod_wsgi and Apache with Django to work
                            
                                Need performance on postGIS with GeoDjango
                            
                                How do I manually add more cookies to a session which already has cookies set in mechanize?
                            
                                Is there anything wrong with creating a Python Pickle powered website?
                            
                                Steps on howto install PySide on windows
                            
                                GAE " no attribute 'HTTPSHandler' " dev_appserver.py
                            
                                clr.AddReferenceToFile() fails in IronPython 2.7
                            
                                QSortFilterProxyModel returning artificial row
                            
                                Open source Twitter clone (in Ruby/Python) [closed]
                            
                                Why does Tkinter frame resize when text box is added to it?
                            
                                Django not cascading on delete
                            
                                How do you bind Home/End to act like Cmd-Left/Cmd-Right in Eclipse on Mac?
                            
                                timeout a subprocess
                            
                                boost::python: howto call a function that expects a pointer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the fastest way to remove duplicate lines in a txt file(and also some lines which contain specific strings) using python?

Tags:

python

file

Shane

People also ask

2 Answers

Robert Rossney

Marcelo Cantos

Recent Activity

Donate For Us