Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex: re.search() is extremely slow on large text files

My code does the following:

  1. Take a large text file (i.e. a legal document that is 300 pages as a PDF).
  2. Find a certain keyword (e.g. "small").
  3. Return n words to the left and n words to the right of the keyword.

NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.

Here is my problem: The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n increases.

Here is my code:

fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()

def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately

    surround = r"\s*(\S*)\s*"
    groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
    return groups[:n],groups[n:]

Here is the nasty culprit:

print search("\$27.5 million", document, 10)

Here's how you can test this code: Copy the function definition from the code block above and run the following:

t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)

I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.

How do I speed up my code?

like image 948
user1274740 Avatar asked Oct 02 '22 09:10

user1274740


2 Answers

How about using re.search (or even string.find if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start and .end on a re matchobject, or the return value of find plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/ etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/ etc. on it.

Also, for help with your backtracking: you can use a pattern like \s+\S+\s+ instead of \s*\S*\s* (two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*s like you do. I think r'\S+'.join([[r'\s+']*(n)) would give the right pattern for capturing n previous words (but my Python is rusty, so check that).

like image 152
hobbs Avatar answered Oct 06 '22 00:10

hobbs


I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:

"Lorem ipsum tritani impedit civibus ei pri"

...when searchText = "tritani" and n = 1, this is what it has to go through before it finds the first match:

regex:      \s*    \S*    \s*    tritani

offset 0:   ''   'Lorem'   ' '   FAIL
            ''   'Lorem'   ''    FAIL
            ''   'Lore'    ''    FAIL
            ''   'Lor'     ''    FAIL
            ''   'Lo'      ''    FAIL
            ''   'L'       ''    FAIL
            ''   ''        ''    FAIL

...then it bumps ahead one position and starts over:

offset 1:   ''   'orem'   ' '    FAIL
            ''   'orem'   ''     FAIL
            ''   'ore'    ''     FAIL
            ''   'or'     ''     FAIL
            ''   'o'      ''     FAIL
            ''   ''       ''     FAIL

... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:

position 5: ' '  'ipsum'  ' '    'tritani'

And that's with just one word to skip over, and with n=1. If you set n=2 you end up with this:

\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*

I sure you can see where this is is going. Note especially that when I change it to this:

(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)

...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using * when you should be using +. In other words, if it's not optional, don't treat it as optional.

Finally, you may have noticed the \s*\s* the auto-generated regex

like image 43
Alan Moore Avatar answered Oct 06 '22 00:10

Alan Moore