When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes.
But then I thought I'd test it, comparing two scenarios:
However, the results were staggering (in Python 3.3):
>>> import timeit
>>> timeit.timeit(setup="import re",
... stmt='r=re.compile(r"\w+")\nfor i in range(10):\n r.search(" jkdhf ")')
18.547793477671938
>>> timeit.timeit(setup="import re",
... stmt='for i in range(10):\n re.search(r"\w+"," jkdhf ")')
106.47892003890324
That's over 5.7 times slower! In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected.
Has caching of regexes changed between Python 2 and 3? The docs don't seem to suggest that.
The re. compile() method We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.
compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program. So my conclusion is, if you are going to match the same pattern for many different texts, you better precompile it.
Python's re. compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re.
The code has changed.
In Python 2.7, the cache is a simple dictionary; if more than _MAXCACHE
items are stored in it, the whole the cache is cleared before storing a new item. A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of _compile()
In Python 3.x, the cache has been replaced by the @functools.lru_cache(maxsize=500, typed=True)
decorator. This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible via re._compile.cache_info()
). See the 3.3.0 implementation of _compile()
and of functools.lru_cache()
.
Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. I'd expect 3.4 to be a lot faster again; either the lru_cache
implementation is improved or the re
module will move to a custom cache again.
Update: With revision 4b4dddd670d0 (hg) / 0f606a6 (git) the cache change has been reverted back to the simple version found in 3.1. Python versions 3.2.4 and 3.3.1 include that revision.
Since then, in Python 3.7 the pattern cache was updated to a custom FIFO cache implementation based on a regular dict
(relying on insertion order, and unlike a LRU, does not take into account how recently items already in the cache were used when evicting).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With