I'm working through Doug Hellman's "The Python Standard Library by Example" and came across this:
"1.3.2 Compiling Expressions re includes module-level functions for working with regular expressions as text strings, but it is more efficient to compile the expressions a program uses frequently."
I couldn't follow his explanation for why this is the case. He says that the "module-level functions maintain a cache of compiled expressions" and that since the "size of the cache" is limited, "using compiled expressions directly avoids the cache lookup overhead."
I'd greatly appreciate it if someone could please explain or direct me to an explanation that I could better understand for why it is more efficient to compile the regular expressions a program uses frequently, and how this process actually works.
Hm. This is strange. My knowledge so far (gained, among other source, from this question) suggested my initial answer:
Python caches the last 100 regexes that you used, so even if you don't compile them explicitly, they don't have to be recompiled at every use.
However, there are two drawbacks: When the limit of 100 regexes is reached, the entire cache is nuked, so if you use 101 different regexes in a row, each one will be recompiled every time. Well, that's rather unlikely, but still.
Second, in order to find out if a regex has been compiled already, the interpreter needs to look up the regex in the cache every time which does take a little extra time (but not much since dictionary lookups are very fast).
So, if you explicitly compile your regexes, you avoid this extra lookup step.
I just did some testing (Python 3.3):
>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search(" jkdhf ")''')
18.547793477671938
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+"," jkdhf ")''')
106.47892003890324
So it would appear that no caching is being done. Perhaps that's a quirk of the special conditions under which timeit.timeit()
runs?
On the other hand, in Python 2.7, the difference is not as noticeable:
>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search(" jkdhf ")''')
7.248294908492429
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+"," jkdhf ")''')
18.26713670282241
I believe what he is trying to say is that you shouldn't compile your regex inside your loop, but outside it. You can then just run the already compiled code inside the loop.
instead of:
while true:
result = re.match('A', str)
You should put:
regex = re.compile('A')
while true:
result = regex.match(str)
Basically re.match(pattern, str)
combines the compilation and matching step. Compiling the same pattern inside the loop is inefficient, and so should be hoisted outside of the loop.
See Tim's answer for the correct reasoning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With