I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extract one line at a time:
big_file = open('C:\\gbigfile.txt', 'r')
small_file3 = open('C:\\small_file3.txt', 'w')
for line in big_file:
if 'S0414' in line:
small_file3.write(line)
gbigfile.close()
small_file3.close()
How can I speed this up for 30,000 lines that I need to look up>?
Aha! So your real problem is how to test many conditions per line and if one of them is satisfied, to output that line. Easiest will be using regular expression, me thinks:
import re
keywords = ['S0414', 'GT213', 'AT3423', 'PR342'] # etc - you probably get those from some source
pattern = re.compile('|'.join(keywords))
for line in inf:
if pattern.search(ln):
outf.write(line)
Testing many conditions per line is generally slow when using a naive algorithm. There are various superior algorithms (e.g. using Tries) which can do much better. I suggest you give the Aho–Corasick string matching algorithm a shot. See here for a python implementation. It should be considerably faster than the naive approach of using a nested loop and testing every string individually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With