Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is a compiled python regex slower?

In another SO question, the performance of regexes and Python's in operator were compared. However, the accepted answer uses re.match, which only matches the beginning of a string, and thus behaves completely different to in. Also, I wanted to see the performance gain of not recompiling the RE each time.

Surprisingly, I see that the pre-compiled version seems to be slower.

Any ideas why?

I am aware that there are quite a few other questions here that wonder about a similar issue. Most of them perform the way they do simply because they do not correctly reuse the compiled regex. If that is also my issue, please explain.

from timeit import timeit
import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

def find():
    assert text.find(pattern) > -1

def re_search():
    assert re.search(pattern, text)

def re_compiled():
    assert re.search(compiled_pattern, text)

def in_find():
    assert pattern in text

print('str.find     ', timeit(find))
print('re.search    ', timeit(re_search))
print('re (compiled)', timeit(re_compiled))
print('in           ', timeit(in_find))

Output:

str.find      0.36285957560356435
re.search     1.047689160564772
re (compiled) 1.575113873320307
in            0.1907925627077569
like image 448
polwel Avatar asked Nov 24 '17 16:11

polwel


People also ask

Why is Python regex slow?

until the end of the string. It effectively has quadratic complexity O(n2) where n is the length of the string. The problem can be resolved by anchoring your pattern, so it fails right away at positions that your pattern has no chance to match.

Does compiling regex make it faster?

I created a much simpler test that will show you that compiled regular expressions are unquestionably faster than not compiled. Here, the compiled regular expression is 35% faster than the not compiled regular expression.

Should I compile regex python?

compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program. So my conclusion is, if you are going to match the same pattern for many different texts, you better precompile it.

What does it mean to compile a regex?

compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re. match() or re.search() .


1 Answers

Short answer

If you call compiled_pattern.search(text) directly, it won't call _compile at all, it will be faster than re.search(pattern, text) and much faster than re.search(compiled_pattern, text).

This performance difference is due to KeyErrors in cache and slow hash calculations for compiled patterns.


re functions and SRE_Pattern methods

Any time a re function with a pattern as 1st argument (e.g. re.search(pattern, string) or re.findall(pattern, string)) is called, Python tries to compile the pattern first with _compile and then calls the corresponding method on the compiled pattern. For example:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

Note that pattern could be either a string or an already compiled pattern (an SRE_Pattern instance).

_compile

Here's a compact version of _compile. I simply removed debug and flags check:

_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512

def _compile(pattern, flags):
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    loc = None
    _cache[type(pattern), pattern, flags] = p, loc
    return p

_compile with String pattern

When _compile is called with a string pattern, the compiled pattern is saved in _cache dict. Next time the same function is called (e.g. during the many timeit runs), _compile simply checks in _cache if this string has already been seen and returns the corresponding compiled pattern.

Using ipdb debugger in Spyder, it's easy to dive into re.py during execution.

import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

re.search(pattern, text)
re.search(pattern, text)

With a breakpoint at the second re.search(pattern, text), it can be seen that the pair:

{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)}

is saved in _cache. The compiled pattern is returned directly.

_compile with compiled pattern

slow hash

What happens if _compile is called with an already compiled pattern?

First, _compile checks if the pattern is in _cache. To do so, it needs to calculate its hash. This calculation is much slower for a compiled pattern than for a string:

In [1]: import re

In [2]: pattern = "(?:a(?:b(?:b\\é|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi
   ...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form
   ...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl
   ...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs
   ...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\é's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation|
   ...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o
   ...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(?
   ...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)"

In [3]: compiled_pattern = re.compile(pattern)

In [4]: % timeit hash(pattern)
126 ns ± 0.358 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [5]: % timeit hash(compiled_pattern)
7.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

hash(compiled_pattern) is 60 times slower than hash(pattern) here.

KeyError

When a pattern is unknown, _cache[type(pattern), pattern, flags] fails with a KeyError.

The KeyError gets handled and ignored. Only then does _compile check if the pattern is already compiled. If it is, it gets returned, without being written in cache.

It means that the next time _compile is called with the same compiled pattern, it will calculate the useless, slow hash again, but will still fail with a KeyError.

Error handling is expensive, and I suppose that's the main reason why re.search(compiled_pattern, text) is slower than re.search(pattern, text).

This weird behaviour might be a choice to speed up calls with string patterns, but it might have been a good idea to write a warning if _compile is called with an already compiled pattern.

like image 85
Eric Duminil Avatar answered Sep 27 '22 22:09

Eric Duminil