Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compiling regexes within a frequently-called function

Let's say I have a function which searches for multiple patterns in a string using regexes:

import re
def get_patterns(string):
    """
    Takes a string and returns found groups
    of numeric and alphabetic characters.

    """
    re_digits = re.compile("(\d+)")
    re_alpha = re.compile("(?i)([A-Z]+)")
    digits = re_digits.findall(string)
    alpha = re_alpha.findall(string)
    return digits, alpha

get_patterns("99 bottles of beer on the wall")
(['99'], ['bottles', 'of', 'beer', 'on', 'the', 'wall'])

Now suppose this function is going to be called hundreds of thousands of times, and that it's not such a trivial example. Does it a) matter whether the regex compilation is being done within the function, i.e. is there an efficiency cost to calling the compile operation at each function call (or is it reused from cache?) and b) if there is, what would be a recommended approach for avoiding that overhead?

One method would be to pass the function a list of compiled regex objects:

re_digits = re.compile("(\d+)")
re_alpha = re.compile("(?i)([A-Z]+)")
def get_patterns(string, [re_digits, re_alpha]):

but I dislike how such an approach dissociates the regexes from the dependent function.

UPDATE: As per Jens' recommendation I've run a quick timing check and doing the compiling within the function's default arguments is indeed quite a bit (~30%) faster:

def get_patterns_defaults(string, 
                          re_digits=re.compile("(\d+)"), 
                          re_alpha=re.compile("(?i)([A-Z]+)")
                          ):
    """
    Takes a string and returns found groups
    of numeric and alphabetic characters.

    """
    digits = re_digits.findall(string)
    alpha = re_alpha.findall(string)
    return digits, alpha

from timeit import Timer
test_string = "99 bottles of beer on the wall"
t = Timer(lambda: get_patterns(test_string))
t2 = Timer(lambda: get_patterns_defaults(test_string))
print t.timeit(number=100000)  # compiled in function body
print t2.timeit(number=100000)  # compiled in args
0.629958152771
0.474529981613
like image 634
glarue Avatar asked Nov 28 '25 01:11

glarue


1 Answers

One solution is to use default arguments, so they are compiled only once:

import re
def get_patterns(string, re_digits=re.compile("(\d+)"), re_alpha=re.compile("(?i)([A-Z]+)")):
    """
    Takes a string and returns found groups
    of numeric and alphabetic characters.

    """
    digits = re_digits.findall(string)
    alpha = re_alpha.findall(string)
    return digits, alpha

Now you can call it:

get_patterns(string)
like image 190
nicolas Avatar answered Nov 29 '25 15:11

nicolas