Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to really pickle compiled regular expressions in python?

I have a python console application that contains 300+ regular expressions. The set of regular expressions is fixed for each release. When users run the app, the entire set of regular expressions will be applied anywhere from once (a very short job) to thousands of times (a long job).

I would like to speed up the shorter jobs by compiling the regular expressions up front, pickle the compiled regular expressions to a file, and then load that file when the application is run.

The python re module is efficient and the regex compilation overhead is quite acceptable for long jobs. For short jobs, however, it is a large proportion of the overall run-time. Some users will want to run many small jobs to fit into their existing workflows. Compiling the regular expressions takes about 80ms. A short job might take 20ms-100ms excluding regular expression compilation. So for short jobs, the overhead can be 100% or more. This is with Python27 under both Windows and Linux.

The regular expressions must be applied with the DOTALL flag, so need to be compiled prior to use. A large compilation cache clearly doesn't help in this instances. As some have pointed out, the default method to serialise the compiled regular expression doesn't actually do much.

The re and sre modules compile the patterns into a little custom language with its own opcodes and some auxiliary data structures (e.g., for charsets used in an expression). The pickle function in re.py takes the easy way out. It is:

def _pickle(p):
    return _compile, (p.pattern, p.flags)

copy_reg.pickle(_pattern_type, _pickle, _compile)

I think that a good solution to the problem would be an update to the definition of _pickle in re.py that actually pickled the compiled pattern object. Unfortunately, this goes beyond my python skills. I bet, however, that someone here knows how to do it.

I realise that I am not the first person to ask this question - but perhaps you can be the first person to give an accurate and useful response to it!

Your advice would be greatly appreciated.

like image 491
Adam Avatar asked Oct 27 '10 20:10

Adam


People also ask

What does\ d+ mean in Python?

What is this doing? Well, \D matches any character except a numeric digit, and + means 1 or more. So \D+ matches one or more characters that are not digits. This is what we're using instead of a literal hyphen, to try to match different separators.

What is Python regex compile?

Python's re. compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re.

What is the re library in Python?

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

How does regex work Python?

Regular Expressions, also known as “regex” or “regexp”, are used to match strings of text such as particular characters, words, or patterns of characters. It means that we can match and extract any string pattern from the text with the help of regular expressions.


2 Answers

As others have mentioned, you can simply pickle the compiled regex. They will pickle and unpickle just fine, and be usable. However, it doesn't look like the pickle actually contains the result of compilation. I suspect you will incur the compilation overhead again when you use the result of the unpickling.

>>> p.dumps(re.compile("a*b+c*"))
"cre\n_compile\np1\n(S'a*b+c*'\np2\nI0\ntRp3\n."
>>> p.dumps(re.compile("a*b+c*x+y*"))
"cre\n_compile\np1\n(S'a*b+c*x+y*'\np2\nI0\ntRp3\n."

In these two tests, you can see the only difference between the two pickles is in the string. Apparently compiled regexes don't pickle the compiled bits, just the string needed to compile it again.

But I'm wondering about your application overall: compiling a regex is a fast operation, how short are your jobs that compiling the regex is significant? One possibility is that you are compiling all 300 regexes, and then only using one for a short job. In that case, don't compile them all up front. The re module is very good at using cached copies of compiled regexes, so you generally don't have to compile them yourself, just use the string form. The re module will lookup the string in a dictionary of compiled regexes, so grabbing the compiled form yourself only saves you a dictionary look up. I may be totally off-base, sorry if so.

like image 164
Ned Batchelder Avatar answered Sep 28 '22 09:09

Ned Batchelder


Just compile as you go - re module will cache the compiled re's even if you dont. Bump the re._MAXCACHE up to 400 or 500, the short jobs will only compile the re's they need, and the long jobs benefit from a big fat cache of compiled expressions - everybody's happy!

like image 42
PaulMcG Avatar answered Sep 28 '22 11:09

PaulMcG