Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In python, is there a way for re.finditer to take a file as input instead of a string?

Let's say I have a really large file foo.txt and I want to iterate through it doing something upon finding a regular expression. Currently I do this:

f = open('foo.txt')
s = f.read()
f.close()
for m in re.finditer(regex, s):
    doSomething()

Is there a way to do this without having to store the entire file in memory?

NOTE: Reading the file line by line is not an option because the regex can possibly span multiple lines.

UPDATE: I would also like this to work with stdin if possible.

UPDATE: I am considering somehow emulating a string object with a custom file wrapper but I am not sure if the regex functions would accept a custom string-like object.

like image 964
Matt Avatar asked Dec 26 '22 22:12

Matt


1 Answers

If you can limit the number of lines that the regex can span to some reasonable number, then you can use a collections.deque to create a rolling window on the file and keep only that number of lines in memory.

from collections import deque

def textwindow(filename, numlines):
    with open(filename) as f:
        window   = deque((f.readline() for i in xrange(numlines)), maxlen=numlines)
        nextline = True
        while nextline:
            text = "".join(window)
            yield text
            nextline = f.readline()
            window.append(nextline)

 for text in textwindow("bigfile.txt", 10):
     # test to see whether your regex matches and do something
like image 174
kindall Avatar answered Jan 12 '23 00:01

kindall