Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way to remove duplicate lines in a txt file(and also some lines which contain specific strings) using python?

Tags:

python

file

The txt is about 22,000 lines, and it's about 3.5MB. There are lots of duplicate lines in it. I simply want to remove the duplicate lines and also some lines which include specific strings not needed.

My way is to read the file into a big list using readlines() method, then read the file as a big string using read() method. Iterate the list, count occurrence, replace the line with ""(empty string). It took me 10 minutes to finish the job?!

Is there any fast way to do this?

Thanks a lot!

like image 975
Shane Avatar asked Oct 22 '10 11:10

Shane


People also ask

How do you remove duplicate lines in a text file in Python?

The uniq command is used to remove duplicate lines from a text file in Linux. By default, this command discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead only print duplicate lines. For uniq to work, you must first sort the output.

What command would you use to eliminate duplicate lines from a file?

Remove duplicate lines with uniq The uniq command ensures that sequential identical lines are reduced to one.

How do you find duplicates in a text file Python?

We have to import os, sys, hashlib libraries. Then script iterates over the files and calls FindDuplicate() function to find duplicates.


2 Answers

I almost always do file processing using generators. This makes for code that's fast, easy to modify, and easy to test.

First, build a generator that removes duplicates:

def remove_duplicates(seq):
    found = set()
    for item in seq:
        if item in found:
            continue
        found.add(item)
        yield item

Does it work?

>>> print "\n".join(remove_duplicates(["aa", "bb", "cc", "aa"]))
aa
bb
cc

Apparently so. Next, create a function that tells you whether or not a line is OK:

def is_line_ok(line):
    if "bad text1" in line:
        return False
    if "bad text2" in line:
        return False
    return True

Does this work?

>>> is_line_ok("this line contains bad text2.")
False
>>> is_line_ok("this line's ok.")
True
>>> 

So now we can use remove_duplicates and itertools.ifilter with our function:

>>> seq = ["OK", "bad text2", "OK", "Also OK"]
>>> print "\n".join(remove_duplicates(ifilter(is_line_ok, seq)))
OK
Also OK

This method works on any iterable that returns strings, including files:

with open(input_file, 'r') as f_in:
    with open(output_file, 'w') as f_out:
       f_out.writelines(remove_duplicates(ifilter(is_line_ok, f_in)))
like image 137
Robert Rossney Avatar answered Sep 19 '22 14:09

Robert Rossney


list(set(line for line in file.readlines()
         if 'badstring' not in line
         and 'garbage' not in line))

Also, a regex might be faster than multiple not in tests.

like image 38
Marcelo Cantos Avatar answered Sep 23 '22 14:09

Marcelo Cantos