Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compiling Regular Expressions in Python

I'm working through Doug Hellman's "The Python Standard Library by Example" and came across this:

"1.3.2 Compiling Expressions re includes module-level functions for working with regular expressions as text strings, but it is more efficient to compile the expressions a program uses frequently."

I couldn't follow his explanation for why this is the case. He says that the "module-level functions maintain a cache of compiled expressions" and that since the "size of the cache" is limited, "using compiled expressions directly avoids the cache lookup overhead."

I'd greatly appreciate it if someone could please explain or direct me to an explanation that I could better understand for why it is more efficient to compile the regular expressions a program uses frequently, and how this process actually works.

like image 761
user7186 Avatar asked Dec 12 '22 18:12

user7186


2 Answers

Hm. This is strange. My knowledge so far (gained, among other source, from this question) suggested my initial answer:


First answer

Python caches the last 100 regexes that you used, so even if you don't compile them explicitly, they don't have to be recompiled at every use.

However, there are two drawbacks: When the limit of 100 regexes is reached, the entire cache is nuked, so if you use 101 different regexes in a row, each one will be recompiled every time. Well, that's rather unlikely, but still.

Second, in order to find out if a regex has been compiled already, the interpreter needs to look up the regex in the cache every time which does take a little extra time (but not much since dictionary lookups are very fast).

So, if you explicitly compile your regexes, you avoid this extra lookup step.


Update

I just did some testing (Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
18.547793477671938
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
106.47892003890324

So it would appear that no caching is being done. Perhaps that's a quirk of the special conditions under which timeit.timeit() runs?

On the other hand, in Python 2.7, the difference is not as noticeable:

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
7.248294908492429
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
18.26713670282241
like image 196
Tim Pietzcker Avatar answered Dec 15 '22 00:12

Tim Pietzcker


I believe what he is trying to say is that you shouldn't compile your regex inside your loop, but outside it. You can then just run the already compiled code inside the loop.

instead of:

while true: 
    result = re.match('A', str)

You should put:

regex = re.compile('A')
while true:
    result = regex.match(str)

Basically re.match(pattern, str) combines the compilation and matching step. Compiling the same pattern inside the loop is inefficient, and so should be hoisted outside of the loop.

See Tim's answer for the correct reasoning.

like image 21
Michael Avatar answered Dec 14 '22 23:12

Michael