Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pythonic and efficient way of defining multiple regexes for use over many iterations

I am presently writing a Python script to process some 10,000 or so input documents. Based on the script's progress output I notice that the first 400+ documents get processed really fast and then the script slows down although the input documents all are approximately the same size.

I am assuming this may have to do with the fact that most of the document processing is done with regexes that I do not save as regex objects once they have been compiled. Instead, I recompile the regexes whenever I need them.

Since my script has about 10 different functions all of which use about 10 - 20 different regex patterns I am wondering what would be a more efficient way in Python to avoid re-compiling the regex patterns over and over again (in Perl I could simply include a modifier //o).

My assumption is that if I store the regex objects in the individual functions using

pattern = re.compile()

the resulting regex object will not be retained until the next invocation of the function for the next iteration (each function is called but once per document).

Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.

Any advice here on how to handle this neatly and efficiently?

like image 394
Pat Avatar asked Mar 28 '12 19:03

Pat


2 Answers

The re module caches compiled regex patterns. The cache is cleared when it reaches a size of re._MAXCACHE which by default is 100. (Since you have 10 functions with 10-20 regexes each (i.e. 100-200 regexes), your observed slow-down makes sense with the clearing of the cache.)

If you are okay with changing private variables, a quick and dirty fix to your program might be to set re._MAXCACHE to a higher value:

import re
re._MAXCACHE = 1000
like image 171
unutbu Avatar answered Sep 26 '22 22:09

unutbu


Last time I looked, re.compile maintained a rather small cache, and when it filled up, just emptied it. DIY with no limit:

class MyRECache(object):
    def __init__(self):
        self.cache = {}
    def compile(self, regex_string):
        if regex_string not in self.cache:
            self.cache[regex_string] = re.compile(regex_string)
        return self.cache[regex_string]
like image 42
John Machin Avatar answered Sep 23 '22 22:09

John Machin