Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract specific text lines?

Tags:

python

I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extract one line at a time:

big_file = open('C:\\gbigfile.txt', 'r')
small_file3 = open('C:\\small_file3.txt', 'w')
for line in big_file:
   if 'S0414' in line:
      small_file3.write(line)
gbigfile.close()
small_file3.close()

How can I speed this up for 30,000 lines that I need to look up>?

like image 932
novak Avatar asked Jul 14 '10 16:07

novak


2 Answers

Aha! So your real problem is how to test many conditions per line and if one of them is satisfied, to output that line. Easiest will be using regular expression, me thinks:

import re
keywords = ['S0414', 'GT213', 'AT3423', 'PR342'] # etc - you probably get those from some source
pattern = re.compile('|'.join(keywords))

for line in inf:
    if pattern.search(ln):
        outf.write(line)
like image 173
Nas Banov Avatar answered Oct 07 '22 19:10

Nas Banov


Testing many conditions per line is generally slow when using a naive algorithm. There are various superior algorithms (e.g. using Tries) which can do much better. I suggest you give the Aho–Corasick string matching algorithm a shot. See here for a python implementation. It should be considerably faster than the naive approach of using a nested loop and testing every string individually.

like image 35
Brian Avatar answered Oct 07 '22 19:10

Brian