In another SO question, the performance of regexes and Python's in
operator were compared. However, the accepted answer uses re.match
, which only matches the beginning of a string, and thus behaves completely different to in
. Also, I wanted to see the performance gain of not recompiling the RE each time.
Any ideas why?
I am aware that there are quite a few other questions here that wonder about a similar issue. Most of them perform the way they do simply because they do not correctly reuse the compiled regex. If that is also my issue, please explain.
from timeit import timeit
import re
pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
'tempor incididunt ut labore et dolore magna aliqua.'
compiled_pattern = re.compile(pattern)
def find():
assert text.find(pattern) > -1
def re_search():
assert re.search(pattern, text)
def re_compiled():
assert re.search(compiled_pattern, text)
def in_find():
assert pattern in text
print('str.find ', timeit(find))
print('re.search ', timeit(re_search))
print('re (compiled)', timeit(re_compiled))
print('in ', timeit(in_find))
Output:
str.find 0.36285957560356435
re.search 1.047689160564772
re (compiled) 1.575113873320307
in 0.1907925627077569
until the end of the string. It effectively has quadratic complexity O(n2) where n is the length of the string. The problem can be resolved by anchoring your pattern, so it fails right away at positions that your pattern has no chance to match.
I created a much simpler test that will show you that compiled regular expressions are unquestionably faster than not compiled. Here, the compiled regular expression is 35% faster than the not compiled regular expression.
compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program. So my conclusion is, if you are going to match the same pattern for many different texts, you better precompile it.
compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re. match() or re.search() .
If you call compiled_pattern.search(text)
directly, it won't call _compile
at all, it will be faster than re.search(pattern, text)
and much faster than re.search(compiled_pattern, text)
.
This performance difference is due to KeyError
s in cache and slow hash calculations for compiled patterns.
re
functions and SRE_Pattern
methodsAny time a re
function with a pattern
as 1st argument (e.g. re.search(pattern, string)
or re.findall(pattern, string)
) is called, Python tries to compile the pattern
first with _compile
and then calls the corresponding method on the compiled pattern. For example:
def search(pattern, string, flags=0):
"""Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
Note that pattern
could be either a string or an already compiled pattern (an SRE_Pattern
instance).
Here's a compact version of _compile
. I simply removed debug and flags check:
_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512
def _compile(pattern, flags):
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if len(_cache) >= _MAXCACHE:
_cache.clear()
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
_compile
with String patternWhen _compile
is called with a string pattern, the compiled pattern is saved in _cache
dict. Next time the same function is called (e.g. during the many timeit
runs), _compile
simply checks in _cache
if this string has already been seen and returns the corresponding compiled pattern.
Using ipdb
debugger in Spyder, it's easy to dive into re.py
during execution.
import re
pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
'tempor incididunt ut labore et dolore magna aliqua.'
compiled_pattern = re.compile(pattern)
re.search(pattern, text)
re.search(pattern, text)
With a breakpoint at the second re.search(pattern, text)
, it can be seen that the pair:
{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)}
is saved in _cache
. The compiled pattern is returned directly.
_compile
with compiled patternWhat happens if _compile
is called with an already compiled pattern?
First, _compile
checks if the pattern is in _cache
. To do so, it needs to calculate its hash. This calculation is much slower for a compiled pattern than for a string:
In [1]: import re
In [2]: pattern = "(?:a(?:b(?:b\\é|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi
...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form
...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl
...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs
...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\é's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation|
...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o
...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(?
...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)"
In [3]: compiled_pattern = re.compile(pattern)
In [4]: % timeit hash(pattern)
126 ns ± 0.358 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [5]: % timeit hash(compiled_pattern)
7.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
hash(compiled_pattern)
is 60 times slower than hash(pattern)
here.
KeyError
When a pattern
is unknown, _cache[type(pattern), pattern, flags]
fails with a KeyError
.
The KeyError
gets handled and ignored. Only then does _compile
check if the pattern is already compiled. If it is, it gets returned, without being written in cache.
It means that the next time _compile
is called with the same compiled pattern, it will calculate the useless, slow hash again, but will still fail with a KeyError
.
Error handling is expensive, and I suppose that's the main reason why re.search(compiled_pattern, text)
is slower than re.search(pattern, text)
.
This weird behaviour might be a choice to speed up calls with string patterns, but it might have been a good idea to write a warning if _compile
is called with an already compiled pattern.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With