When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes. But then I thought I'd test it, comparing two scenarios: <ol> <li>a single compilation of a simple regex, then 10 applications of that compiled regex.</li> <li>10 applications of an uncompiled regex (where I would have expected slightly worse performance because the regex would have to be compiled once, then cached, and then looked up in the cache 9 times).</li> </ol> However, the results were staggering (in Python 3.3): <pre class="prettyprint"><code>>>> import timeit >>> timeit.timeit(setup="import re", ... stmt='r=re.compile(r"\w+")\nfor i in range(10):\n r.search(" jkdhf ")') 18.547793477671938 >>> timeit.timeit(setup="import re", ... stmt='for i in range(10):\n re.search(r"\w+"," jkdhf ")') 106.47892003890324 </code></pre> That's over 5.7 times slower! In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected. Has caching of regexes changed between Python 2 and 3? The docs don't seem to suggest that.

The code has changed. In Python 2.7, the cache is a simple dictionary; if more than <code>_MAXCACHE</code> items are stored in it, the whole the cache is cleared before storing a new item. A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of <code>_compile()</code> In Python 3.x, the cache has been replaced by the <code>@functools.lru_cache(maxsize=500, typed=True)</code> decorator. This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible via <code>re._compile.cache_info()</code>). See the 3.3.0 implementation of <code>_compile()</code> and of <code>functools.lru_cache()</code>. Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. I'd expect 3.4 to be a lot faster again; either the <code>lru_cache</code> implementation is improved or the <code>re</code> module will move to a custom cache again. Update: With revision 4b4dddd670d0 (hg) / 0f606a6 (git) the cache change has been reverted back to the simple version found in 3.1. Python versions 3.2.4 and 3.3.1 include that revision. Since then, in Python 3.7 the pattern cache was updated to a custom FIFO cache implementation based on a regular <code>dict</code> (relying on insertion order, and unlike a LRU, does not take into account how recently items already in the cache were used when evicting).

Why are uncompiled, repeatedly used regexes so much slower in Python 3?

Tags:

python

regex

caching

When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes.

But then I thought I'd test it, comparing two scenarios:

a single compilation of a simple regex, then 10 applications of that compiled regex.
10 applications of an uncompiled regex (where I would have expected slightly worse performance because the regex would have to be compiled once, then cached, and then looked up in the cache 9 times).

However, the results were staggering (in Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", 
... stmt='r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")')
18.547793477671938
>>> timeit.timeit(setup="import re", 
... stmt='for i in range(10):\n re.search(r"\w+","  jkdhf  ")')
106.47892003890324

That's over 5.7 times slower! In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected.

Has caching of regexes changed between Python 2 and 3? The docs don't seem to suggest that.

818

asked Feb 07 '13 17:02

Tim Pietzcker

1 Answers

The code has changed.

In Python 2.7, the cache is a simple dictionary; if more than _MAXCACHE items are stored in it, the whole the cache is cleared before storing a new item. A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of _compile()

In Python 3.x, the cache has been replaced by the @functools.lru_cache(maxsize=500, typed=True) decorator. This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible via re._compile.cache_info()). See the 3.3.0 implementation of _compile() and of functools.lru_cache().

Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. I'd expect 3.4 to be a lot faster again; either the lru_cache implementation is improved or the re module will move to a custom cache again.

Update: With revision 4b4dddd670d0 (hg) / 0f606a6 (git) the cache change has been reverted back to the simple version found in 3.1. Python versions 3.2.4 and 3.3.1 include that revision.

Since then, in Python 3.7 the pattern cache was updated to a custom FIFO cache implementation based on a regular dict (relying on insertion order, and unlike a LRU, does not take into account how recently items already in the cache were used when evicting).

144

answered Oct 01 '22 04:10

Martijn Pieters

Related questions
                            
                                Set PYTHONPATH in Emacs on MacOS?
                            
                                storing unbound python functions in a class object
                            
                                Stop execution of a script called with execfile
                            
                                ftplib checking if a file is a folder?
                            
                                MySQL LOAD DATA LOCAL INFILE example in python?
                            
                                Python Auto Importing [duplicate]
                            
                                How check if a task is already in python Queue?
                            
                                Race-condition creating folder in Python
                            
                                Python multiprocessing process vs. standalone Python VM
                            
                                Is there a multithreaded map() function? [closed]
                            
                                Subsetting data in Python
                            
                                python 3: how to check if an object is a function? [duplicate]
                            
                                Can a python program be run on a computer without Python? What about C/C++?
                            
                                How to use pipe in IPython
                            
                                Jinja2 ignore UndefinedErrors for objects that aren't found
                            
                                How to monkey patch Django?
                            
                                django querysets + memcached: best practices
                            
                                slices to immutable strings by reference and not copy
                            
                                UUID field added after data already in database. Is there any way to populate the UUID field for existing data?
                            
                                Python Opencv SolvePnP yields wrong translation vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With