<p>In another SO question, the performance of regexes and Python's <code>in</code> operator were compared. However, the accepted answer uses <code>re.match</code>, which only matches the beginning of a string, and thus behaves completely different to <code>in</code>. Also, I wanted to see the performance gain of not recompiling the RE each time.</p> <h3>Surprisingly, I see that the pre-compiled version seems to be <em>slower</em>.</h3> <p>Any ideas why?</p> <p>I am aware that there are quite a few other questions here that wonder about a similar issue. Most of them perform the way they do simply because they do not correctly reuse the compiled regex. If that is also my issue, please explain.</p> <pre class="prettyprint"><code>from timeit import timeit import re pattern = 'sed' text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \ 'tempor incididunt ut labore et dolore magna aliqua.' compiled_pattern = re.compile(pattern) def find(): assert text.find(pattern) > -1 def re_search(): assert re.search(pattern, text) def re_compiled(): assert re.search(compiled_pattern, text) def in_find(): assert pattern in text print('str.find ', timeit(find)) print('re.search ', timeit(re_search)) print('re (compiled)', timeit(re_compiled)) print('in ', timeit(in_find)) </code></pre> <p>Output:</p> <pre class="prettyprint"><code>str.find 0.36285957560356435 re.search 1.047689160564772 re (compiled) 1.575113873320307 in 0.1907925627077569 </code></pre>

<h3>Short answer</h3> <p>If you call <code>compiled_pattern.search(text)</code> directly, it won't call <code>_compile</code> at all, it will be faster than <code>re.search(pattern, text)</code> and much faster than <code>re.search(compiled_pattern, text)</code>.</p> <p>This performance difference is due to <code>KeyError</code>s in cache and slow hash calculations for compiled patterns.</p> <hr> <h3> <code>re</code> functions and <code>SRE_Pattern</code> methods</h3> <p>Any time a <code>re</code> function with a <code>pattern</code> as 1st argument (e.g. <code>re.search(pattern, string)</code> or <code>re.findall(pattern, string)</code>) is called, Python tries to compile the <code>pattern</code> first with <code>_compile</code> and then calls the corresponding method on the compiled pattern. For example:</p> <pre class="prettyprint"><code>def search(pattern, string, flags=0): """Scan through string looking for a match to the pattern, returning a match object, or None if no match was found.""" return _compile(pattern, flags).search(string) </code></pre> <p>Note that <code>pattern</code> could be either a string or an already compiled pattern (an <code>SRE_Pattern</code> instance).</p> <h3>_compile</h3> <p>Here's a compact version of <code>_compile</code>. I simply removed debug and flags check:</p> <pre class="prettyprint"><code>_cache = {} _pattern_type = type(sre_compile.compile("", 0)) _MAXCACHE = 512 def _compile(pattern, flags): try: p, loc = _cache[type(pattern), pattern, flags] if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE): return p except KeyError: pass if isinstance(pattern, _pattern_type): return pattern if not sre_compile.isstring(pattern): raise TypeError("first argument must be string or compiled pattern") p = sre_compile.compile(pattern, flags) if len(_cache) >= _MAXCACHE: _cache.clear() loc = None _cache[type(pattern), pattern, flags] = p, loc return p </code></pre> <h3> <code>_compile</code> with String pattern</h3> <p>When <code>_compile</code> is called with a string pattern, the compiled pattern is saved in <code>_cache</code> dict. Next time the same function is called (e.g. during the many <code>timeit</code> runs), <code>_compile</code> simply checks in <code>_cache</code> if this string has already been seen and returns the corresponding compiled pattern.</p> <p>Using <code>ipdb</code> debugger in Spyder, it's easy to dive into <code>re.py</code> during execution.</p> <pre class="prettyprint"><code>import re pattern = 'sed' text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \ 'tempor incididunt ut labore et dolore magna aliqua.' compiled_pattern = re.compile(pattern) re.search(pattern, text) re.search(pattern, text) </code></pre> <p>With a breakpoint at the second <code>re.search(pattern, text)</code>, it can be seen that the pair:</p> <pre class="prettyprint"><code>{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)} </code></pre> <p>is saved in <code>_cache</code>. The compiled pattern is returned directly.</p> <h3> <code>_compile</code> with compiled pattern</h3> <h3>slow hash</h3> <p>What happens if <code>_compile</code> is called with an already compiled pattern?</p> <p>First, <code>_compile</code> checks if the pattern is in <code>_cache</code>. To do so, it needs to calculate its hash. This calculation is much slower for a compiled pattern than for a string:</p> <pre class="prettyprint"><code>In [1]: import re In [2]: pattern = "(?:a(?:b(?:b\\é|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi ...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form ...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl ...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs ...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\é's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation| ...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o ...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(? ...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)" In [3]: compiled_pattern = re.compile(pattern) In [4]: % timeit hash(pattern) 126 ns ± 0.358 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) In [5]: % timeit hash(compiled_pattern) 7.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) </code></pre> <p><code>hash(compiled_pattern)</code> is 60 times slower than <code>hash(pattern)</code> here.</p> <h3><code>KeyError</code></h3> <p>When a <code>pattern</code> is unknown, <code>_cache[type(pattern), pattern, flags]</code> fails with a <code>KeyError</code>.</p> <p>The <code>KeyError</code> gets handled and ignored. Only then does <code>_compile</code> check if the pattern is already compiled. If it is, it gets returned, without being written in cache.</p> <p>It means that the next time <code>_compile</code> is called with the same compiled pattern, it will calculate the useless, slow hash again, but will still fail with a <code>KeyError</code>.</p> <p>Error handling is expensive, and I suppose that's the main reason why <code>re.search(compiled_pattern, text)</code> is slower than <code>re.search(pattern, text)</code>.</p> <p>This weird behaviour might be a choice to speed up calls with string patterns, but it might have been a good idea to write a warning if <code>_compile</code> is called with an already compiled pattern.</p>

Why is a compiled python regex slower?

In another SO question, the performance of regexes and Python's in operator were compared. However, the accepted answer uses re.match, which only matches the beginning of a string, and thus behaves completely different to in. Also, I wanted to see the performance gain of not recompiling the RE each time.

Surprisingly, I see that the pre-compiled version seems to be slower.

Any ideas why?

I am aware that there are quite a few other questions here that wonder about a similar issue. Most of them perform the way they do simply because they do not correctly reuse the compiled regex. If that is also my issue, please explain.

from timeit import timeit
import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

def find():
    assert text.find(pattern) > -1

def re_search():
    assert re.search(pattern, text)

def re_compiled():
    assert re.search(compiled_pattern, text)

def in_find():
    assert pattern in text

print('str.find     ', timeit(find))
print('re.search    ', timeit(re_search))
print('re (compiled)', timeit(re_compiled))
print('in           ', timeit(in_find))

Output:

str.find      0.36285957560356435
re.search     1.047689160564772
re (compiled) 1.575113873320307
in            0.1907925627077569

Why is Python regex slow?

until the end of the string. It effectively has quadratic complexity O(n2) where n is the length of the string. The problem can be resolved by anchoring your pattern, so it fails right away at positions that your pattern has no chance to match.

Does compiling regex make it faster?

I created a much simpler test that will show you that compiled regular expressions are unquestionably faster than not compiled. Here, the compiled regular expression is 35% faster than the not compiled regular expression.

Should I compile regex python?

compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program. So my conclusion is, if you are going to match the same pattern for many different texts, you better precompile it.

What does it mean to compile a regex?

compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re. match() or re.search() .

Short answer

If you call compiled_pattern.search(text) directly, it won't call _compile at all, it will be faster than re.search(pattern, text) and much faster than re.search(compiled_pattern, text).

This performance difference is due to KeyErrors in cache and slow hash calculations for compiled patterns.

`re` functions and `SRE_Pattern` methods

Any time a re function with a pattern as 1st argument (e.g. re.search(pattern, string) or re.findall(pattern, string)) is called, Python tries to compile the pattern first with _compile and then calls the corresponding method on the compiled pattern. For example:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

Note that pattern could be either a string or an already compiled pattern (an SRE_Pattern instance).

_compile

Here's a compact version of _compile. I simply removed debug and flags check:

_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512

def _compile(pattern, flags):
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    loc = None
    _cache[type(pattern), pattern, flags] = p, loc
    return p

`_compile` with String pattern

When _compile is called with a string pattern, the compiled pattern is saved in _cache dict. Next time the same function is called (e.g. during the many timeit runs), _compile simply checks in _cache if this string has already been seen and returns the corresponding compiled pattern.

Using ipdb debugger in Spyder, it's easy to dive into re.py during execution.

import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

re.search(pattern, text)
re.search(pattern, text)

With a breakpoint at the second re.search(pattern, text), it can be seen that the pair:

{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)}

is saved in _cache. The compiled pattern is returned directly.

`_compile` with compiled pattern

slow hash

What happens if _compile is called with an already compiled pattern?

First, _compile checks if the pattern is in _cache. To do so, it needs to calculate its hash. This calculation is much slower for a compiled pattern than for a string:

In [1]: import re

In [2]: pattern = "(?:a(?:b(?:b\\é|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi
   ...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form
   ...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl
   ...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs
   ...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\é's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation|
   ...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o
   ...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(?
   ...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)"

In [3]: compiled_pattern = re.compile(pattern)

In [4]: % timeit hash(pattern)
126 ns ± 0.358 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [5]: % timeit hash(compiled_pattern)
7.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

hash(compiled_pattern) is 60 times slower than hash(pattern) here.

`KeyError`

When a pattern is unknown, _cache[type(pattern), pattern, flags] fails with a KeyError.

The KeyError gets handled and ignored. Only then does _compile check if the pattern is already compiled. If it is, it gets returned, without being written in cache.

It means that the next time _compile is called with the same compiled pattern, it will calculate the useless, slow hash again, but will still fail with a KeyError.

Error handling is expensive, and I suppose that's the main reason why re.search(compiled_pattern, text) is slower than re.search(pattern, text).

This weird behaviour might be a choice to speed up calls with string patterns, but it might have been a good idea to write a warning if _compile is called with an already compiled pattern.

Why is a compiled python regex slower?

Tags:

python

regex

python-3.x

Surprisingly, I see that the pre-compiled version seems to be slower.

polwel

People also ask

1 Answers

Short answer

`re` functions and `SRE_Pattern` methods

_compile

`_compile` with String pattern

`_compile` with compiled pattern

slow hash

`KeyError`

Eric Duminil

Recent Activity

Donate For Us

Why is a compiled python regex slower?

Tags:

python

regex

python-3.x

Surprisingly, I see that the pre-compiled version seems to be slower.

polwel

People also ask

1 Answers

Short answer

re functions and SRE_Pattern methods

_compile

_compile with String pattern

_compile with compiled pattern

slow hash

KeyError

Eric Duminil

Related questions

Recent Activity

Donate For Us

`re` functions and `SRE_Pattern` methods

`_compile` with String pattern

`_compile` with compiled pattern

`KeyError`