The txt is about 22,000 lines, and it's about 3.5MB. There are lots of duplicate lines in it. I simply want to remove the duplicate lines and also some lines which include specific strings not needed.
My way is to read the file into a big list using readlines() method, then read the file as a big string using read() method. Iterate the list, count occurrence, replace the line with ""(empty string). It took me 10 minutes to finish the job?!
Is there any fast way to do this?
Thanks a lot!
The uniq command is used to remove duplicate lines from a text file in Linux. By default, this command discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead only print duplicate lines. For uniq to work, you must first sort the output.
Remove duplicate lines with uniq The uniq command ensures that sequential identical lines are reduced to one.
We have to import os, sys, hashlib libraries. Then script iterates over the files and calls FindDuplicate() function to find duplicates.
I almost always do file processing using generators. This makes for code that's fast, easy to modify, and easy to test.
First, build a generator that removes duplicates:
def remove_duplicates(seq):
found = set()
for item in seq:
if item in found:
continue
found.add(item)
yield item
Does it work?
>>> print "\n".join(remove_duplicates(["aa", "bb", "cc", "aa"]))
aa
bb
cc
Apparently so. Next, create a function that tells you whether or not a line is OK:
def is_line_ok(line):
if "bad text1" in line:
return False
if "bad text2" in line:
return False
return True
Does this work?
>>> is_line_ok("this line contains bad text2.")
False
>>> is_line_ok("this line's ok.")
True
>>>
So now we can use remove_duplicates
and itertools.ifilter
with our function:
>>> seq = ["OK", "bad text2", "OK", "Also OK"]
>>> print "\n".join(remove_duplicates(ifilter(is_line_ok, seq)))
OK
Also OK
This method works on any iterable that returns strings, including files:
with open(input_file, 'r') as f_in:
with open(output_file, 'w') as f_out:
f_out.writelines(remove_duplicates(ifilter(is_line_ok, f_in)))
list(set(line for line in file.readlines()
if 'badstring' not in line
and 'garbage' not in line))
Also, a regex might be faster than multiple not in
tests.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With