My code does the following:
n
words to the left and n
words to the right of the keyword.NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.
Here is my problem:
The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n
increases.
Here is my code:
fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()
def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately
surround = r"\s*(\S*)\s*"
groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
return groups[:n],groups[n:]
Here is the nasty culprit:
print search("\$27.5 million", document, 10)
Here's how you can test this code: Copy the function definition from the code block above and run the following:
t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)
I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.
How do I speed up my code?
How about using re.search
(or even string.find
if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start
and .end
on a re matchobject, or the return value of find
plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/
etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/
etc. on it.
Also, for help with your backtracking: you can use a pattern like \s+\S+\s+
instead of \s*\S*\s*
(two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*
s like you do. I think r'\S+'.join([[r'\s+']*(n))
would give the right pattern for capturing n
previous words (but my Python is rusty, so check that).
I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:
"Lorem ipsum tritani impedit civibus ei pri"
...when searchText = "tritani"
and n = 1
, this is what it has to go through before it finds the first match:
regex: \s* \S* \s* tritani
offset 0: '' 'Lorem' ' ' FAIL
'' 'Lorem' '' FAIL
'' 'Lore' '' FAIL
'' 'Lor' '' FAIL
'' 'Lo' '' FAIL
'' 'L' '' FAIL
'' '' '' FAIL
...then it bumps ahead one position and starts over:
offset 1: '' 'orem' ' ' FAIL
'' 'orem' '' FAIL
'' 'ore' '' FAIL
'' 'or' '' FAIL
'' 'o' '' FAIL
'' '' '' FAIL
... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:
position 5: ' ' 'ipsum' ' ' 'tritani'
And that's with just one word to skip over, and with n=1
. If you set n=2
you end up with this:
\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*
I sure you can see where this is is going. Note especially that when I change it to this:
(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)
...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using *
when you should be using +
. In other words, if it's not optional, don't treat it as optional.
Finally, you may have noticed the \s*\s*
the auto-generated regex
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With