In python, is there a way for re.finditer to take a file as input instead of a string?

Question

Let's say I have a really large file foo.txt and I want to iterate through it doing something upon finding a regular expression. Currently I do this:

f = open('foo.txt')
s = f.read()
f.close()
for m in re.finditer(regex, s):
    doSomething()

Is there a way to do this without having to store the entire file in memory?

NOTE: Reading the file line by line is not an option because the regex can possibly span multiple lines.

UPDATE: I would also like this to work with stdin if possible.

UPDATE: I am considering somehow emulating a string object with a custom file wrapper but I am not sure if the regex functions would accept a custom string-like object.

kindall · Accepted Answer

If you can limit the number of lines that the regex can span to some reasonable number, then you can use a collections.deque to create a rolling window on the file and keep only that number of lines in memory.

from collections import deque

def textwindow(filename, numlines):
    with open(filename) as f:
        window   = deque((f.readline() for i in xrange(numlines)), maxlen=numlines)
        nextline = True
        while nextline:
            text = "".join(window)
            yield text
            nextline = f.readline()
            window.append(nextline)

 for text in textwindow("bigfile.txt", 10):
     # test to see whether your regex matches and do something

In python, is there a way for re.finditer to take a file as input instead of a string?

Tags:

python

regex

file-io

Matt

1 Answers

kindall

Recent Activity

Donate For Us

In python, is there a way for re.finditer to take a file as input instead of a string?

Tags:

python

regex

file-io

Matt

1 Answers

kindall

Related questions

Recent Activity

Donate For Us