I have multiple files, each of which I am searching for a sequence of words.
My regex expression basically searches for a sequence where word1 is followed by word2 followed by word 3 etc.. So the expression looks like:
strings = re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
For files below 20kb, the expression executes pretty well. However, the execution time exponentially increases for files > 20 kb and the process completely hangs for files close to 100kb. It appears (after having read previous threads) that the problem is to do with using .* in conjunction with re.DOTALL - leading to "catastrophic backtracking". The recommended solution was to provide the input file line by line instead of reading the whole file into a single memory buffer.
However, my input file is filled with random whitespace and "\n" newline characters. My word sequence is also long and occurs over multiple lines. Therefore, I need to input the whole file together into the regex expression in conjunction with re.DOTALL - otherwise a line by line search will never find my sequence.
Is there any way around it?
If you're literally searching for the occurrence of three words, with no regex patterns in them at all, there's no need to use regexes at all – as suggested by @Bart as I wrote this answer :). Something like this might work (untested, and can probably be prettier):
with open('...') as f:
contents = f.read()
words = ['word1', 'word2', 'word3']
matches = []
start_idx = 0
try:
while True:
cand = []
for word in words:
word_idx = contents.index(word, start_idx)
cand.append(word_idx)
start_idx = word_idx + len(word)
matches.append(cand)
except ValueError: # from index() failing
pass
This puts the indices in matches; if you want an equivalent result to findall, you could do, say,
found = [contents[match[0]:match[-1]+len(words[-1]] for match in matches]
You could also make this kind of approach work without reading the whole file in beforehand by replacing the call to index with an equivalent function on files. I don't think the stdlib includes such a function; you'd probably have to manually use readline() and tell() or similar methods on file objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With