From Perl's documentation:
study
takes extra time to study SCALAR ($_ if unspecified) in anticipation of doing many pattern matches on the string before it is next modified. This may or may not save time, depending on the nature and number of patterns you are searching and the distribution of character frequencies in the string to be searched;
I'm trying to speed up some regular expression-driven parsing that I'm doing in Python, and I remembered this trick from Perl. I realize I'll have to benchmark to determine if there is a speedup, but I can't find an equivalent method in Python.
Perl’s study doesn’t really do much anymore. The regex compiled has gotten a whole, whole lot smarter than it was when study
was created.
For example, it compiles alternatives into a trie structure with Aho–Corasick prediction.
Run with perl -Mre=debug
to see the sorts of cleverness the regex compiler and execution engine apply.
As far as I know there's nothing like this built into Python. But according to the perldoc:
The way study works is this: a linked list of every character in the string to be searched is made, so we know, for example, where all the 'k' characters are. From each search string, the rarest character is selected, based on some static frequency tables constructed from some C programs and English text. Only those places that contain this "rarest" character are examined.
This doesn't sound very sophisticated, and you could probably hack together something equivalent yourself.
esmre is kind of vaguely similar. And as @Frg noted, you'll want to use re.compile
if you're reusing a single regex (to avoid re-parsing the regex itself over and over).
Or you could use suffix trees (here's one implementation, or here's a C extension with unicode support) or suffix arrays (implementation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With