Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Python's regex pattern caching work?

From the Python docs for re.compile():

Note The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

However, in my testing, this assertion doesn't seem to hold up. When timing the following snippets that use the same pattern repeatedly, the compiled version is still substantially faster than the uncompiled one (which should supposedly be cached).

Is there something I am missing here that explains the time difference?

import timeit

setup = """
import re
pattern = "p.a.t.t.e.r.n"
target = "p1a2t3t4e5r6n"
r = re.compile(pattern)
"""

print "compiled:", \
    min(timeit.Timer("r.search(target)", setup).repeat(3, 5000000))
print "uncompiled:", \
    min(timeit.Timer("re.search(pattern, target)", setup).repeat(3, 5000000))

Results:

compiled: 2.26673030059
uncompiled: 6.15612802627
like image 221
verdesmarald Avatar asked Sep 20 '12 13:09

verdesmarald


People also ask

How does Python regex work?

Regular Expressions, also known as “regex” or “regexp”, are used to match strings of text such as particular characters, words, or patterns of characters. It means that we can match and extract any string pattern from the text with the help of regular expressions.

How does regex compile work?

compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re. match() or re.search() .

Is compiled regex faster?

compiled regex's being slower than interpreted ones. There is a lot of work that went into making compiled regex's performant. To get a good comparison between Compiled and not compiled you should test the performance of a single compiled Regex and single non-compiled Regex matching N times.

What flavor of regex does Python use?

Python: The regex flavor supported by Python's built-in re module. Ruby: The regex flavor built into the Ruby programming language.


1 Answers

Here's the (CPython) implementation of re.search:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

and here is re.compile:

def compile(pattern, flags=0):
    "Compile a regular expression pattern, returning a pattern object."
    return _compile(pattern, flags)

which relies on re._compile:

def _compile(*key):
    # internal: compile pattern
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)            #_cache is a dict.   
    if p is not None:
        return p
    pattern, flags = key
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError('Cannot process flags argument with a compiled pattern')
        return pattern 
    if not sre_compile.isstring(pattern):
        raise TypeError, "first argument must be string or compiled pattern"
    try:
        p = sre_compile.compile(pattern, flags)
    except error, v:
        raise error, v # invalid expression
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p

So you can see that as long as the regex is already in the dictionary, the only extra work involved is the lookup in the dictionary (which involves creating a few temporary tuples, a few extra function calls ...).

Update In the good ole' days (the code copied above), the cache used to be completely invalidated when it got too big. These days, the cache cycles -- dropping the oldest items first. This implementation relies on the ordering of python dictionaries (which was an implementation detail until python3.7). In Cpython before python3.6, this would have dropped an arbitrary value out of the cache (which is arguably still better than invalidating the whole cache)

like image 192
mgilson Avatar answered Oct 07 '22 03:10

mgilson