Each time a python file is imported that contains a large quantity of static regular expressions, cpu cycles are spent compiling the strings into their representative state machines in memory.
a = re.compile("a.*b")
b = re.compile("c.*d")
...
Question: Is it possible to store these regular expressions in a cache on disk in a pre-compiled manner to avoid having to execute the regex compilations on each import?
Pickling the object simply does the following, causing compilation to happen anyway:
>>> import pickle
>>> import re
>>> x = re.compile(".*")
>>> pickle.dumps(x)
"cre\n_compile\np0\n(S'.*'\np1\nI0\ntp2\nRp3\n."
And re
objects are unmarshallable:
>>> import marshal
>>> import re
>>> x = re.compile(".*")
>>> marshal.dumps(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unmarshallable object
As you know, Python always internally compiles and caches regexes whenever you use them anyway (including calls to search() or match()), so using compile() method, you're only changing when the regex gets compiled.
If a Regex object is constructed with the RegexOptions. Compiled option, it compiles the regular expression to explicit MSIL code instead of high-level regular expression internal instructions. This allows .
Finditer method finditer() works exactly the same as the re. findall() method except it returns an iterator yielding match objects matching the regex pattern in a string instead of a list. It scans the string from left to right, and matches are returned in the iterator form.
Why are raw strings often used when creating Regex objects? Raw strings are used so that backslashes do not have to be escaped.
Is it possible to store these regular expressions in a cache on disk in a pre-compiled manner to avoid having to execute the regex compilations on each import?
Not easily. You'd have to write a custom serializer that hooks into the C sre
implementation of the Python regex engine. Any performance benefits would be vastly outweighed by the time and effort required.
First, have you actually profiled the code? I doubt that compiling regexes is a significant part of the application's run-time. Remember that they are only compiled the first time the module is imported in the current execution -- thereafter, the module and its attributes are cached in memory.
If you have a program that basically spawns once, compiles a bunch of regexes, and then exits, you could try re-engineering it to perform multiple tests in one invocation. Then you could re-use the regexes, as above.
Finally, you could compile the regexes into C-based state machines and then link them in with an extension module. While this would likely be more difficult to maintain, it would eliminate regex compilation entirely from your application.
Note that each module initializes itself only once during the life of an app, no matter how many times you import it. So if you compile your expressions at the module's global scope (ie. not in a function) you should be fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With