Let's say I have a really large file foo.txt
and I want to iterate through it doing something upon finding a regular expression. Currently I do this:
f = open('foo.txt')
s = f.read()
f.close()
for m in re.finditer(regex, s):
doSomething()
Is there a way to do this without having to store the entire file in memory?
NOTE: Reading the file line by line is not an option because the regex can possibly span multiple lines.
UPDATE: I would also like this to work with stdin
if possible.
UPDATE: I am considering somehow emulating a string object with a custom file wrapper but I am not sure if the regex functions would accept a custom string-like object.
If you can limit the number of lines that the regex can span to some reasonable number, then you can use a collections.deque
to create a rolling window on the file and keep only that number of lines in memory.
from collections import deque
def textwindow(filename, numlines):
with open(filename) as f:
window = deque((f.readline() for i in xrange(numlines)), maxlen=numlines)
nextline = True
while nextline:
text = "".join(window)
yield text
nextline = f.readline()
window.append(nextline)
for text in textwindow("bigfile.txt", 10):
# test to see whether your regex matches and do something
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With