Extract specific text lines?

Question

I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extract one line at a time:

big_file = open('C:\gbigfile.txt', 'r')
small_file3 = open('C:\small_file3.txt', 'w')
for line in big_file:
   if 'S0414' in line:
      small_file3.write(line)
gbigfile.close()
small_file3.close()

How can I speed this up for 30,000 lines that I need to look up>?

Nas Banov · Accepted Answer

Aha! So your real problem is how to test many conditions per line and if one of them is satisfied, to output that line. Easiest will be using regular expression, me thinks:

import re
keywords = ['S0414', 'GT213', 'AT3423', 'PR342'] # etc - you probably get those from some source
pattern = re.compile('|'.join(keywords))

for line in inf:
    if pattern.search(ln):
        outf.write(line)

Brian · Answer

Testing many conditions per line is generally slow when using a naive algorithm. There are various superior algorithms (e.g. using Tries) which can do much better. I suggest you give the Aho–Corasick string matching algorithm a shot. See here for a python implementation. It should be considerably faster than the naive approach of using a nested loop and testing every string individually.

Extract specific text lines?

Tags:

python

novak

2 Answers

Nas Banov

Brian

Recent Activity

Donate For Us

Extract specific text lines?

Tags:

python

novak

2 Answers

Nas Banov

Brian

Related questions

Recent Activity

Donate For Us