Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Caching compiled regex objects in Python?

Tags:

Each time a python file is imported that contains a large quantity of static regular expressions, cpu cycles are spent compiling the strings into their representative state machines in memory.

a = re.compile("a.*b")
b = re.compile("c.*d")
...

Question: Is it possible to store these regular expressions in a cache on disk in a pre-compiled manner to avoid having to execute the regex compilations on each import?

Pickling the object simply does the following, causing compilation to happen anyway:

>>> import pickle
>>> import re
>>> x = re.compile(".*")
>>> pickle.dumps(x)
"cre\n_compile\np0\n(S'.*'\np1\nI0\ntp2\nRp3\n."

And re objects are unmarshallable:

>>> import marshal
>>> import re
>>> x = re.compile(".*")
>>> marshal.dumps(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unmarshallable object
like image 926
Sufian Avatar asked Sep 15 '08 18:09

Sufian


People also ask

Does Python cache regex?

As you know, Python always internally compiles and caches regexes whenever you use them anyway (including calls to search() or match()), so using compile() method, you're only changing when the regex gets compiled.

What is regex compilation?

If a Regex object is constructed with the RegexOptions. Compiled option, it compiles the regular expression to explicit MSIL code instead of high-level regular expression internal instructions. This allows .

How do I use Finditer in Python?

Finditer method finditer() works exactly the same as the re. findall() method except it returns an iterator yielding match objects matching the regex pattern in a string instead of a list. It scans the string from left to right, and matches are returned in the iterator form.

Why do raw strings often appear in regex objects?

Why are raw strings often used when creating Regex objects? Raw strings are used so that backslashes do not have to be escaped.


2 Answers

Is it possible to store these regular expressions in a cache on disk in a pre-compiled manner to avoid having to execute the regex compilations on each import?

Not easily. You'd have to write a custom serializer that hooks into the C sre implementation of the Python regex engine. Any performance benefits would be vastly outweighed by the time and effort required.

First, have you actually profiled the code? I doubt that compiling regexes is a significant part of the application's run-time. Remember that they are only compiled the first time the module is imported in the current execution -- thereafter, the module and its attributes are cached in memory.

If you have a program that basically spawns once, compiles a bunch of regexes, and then exits, you could try re-engineering it to perform multiple tests in one invocation. Then you could re-use the regexes, as above.

Finally, you could compile the regexes into C-based state machines and then link them in with an extension module. While this would likely be more difficult to maintain, it would eliminate regex compilation entirely from your application.

like image 162
John Millikin Avatar answered Sep 24 '22 23:09

John Millikin


Note that each module initializes itself only once during the life of an app, no matter how many times you import it. So if you compile your expressions at the module's global scope (ie. not in a function) you should be fine.

like image 27
Toni Ruža Avatar answered Sep 25 '22 23:09

Toni Ruža