Is there any regular expression engine that does Just-In-Time compiling during regex pattern parsing and use when matching / replacing the texts? Or where can I learn JIT for i386 or x64 architecture?
I was recently trying to benchmark Python’s built-in regex engine compared with normal C code with around 10MB of data.
I found that for a straightforward replacing (for example ab
to zzz
) it’s relatively fast: just 2 to 3 times slower than C.
But for [a-z]c
it took around 5 to 8 times as much time as C.
And with grouping (e.g. ([a-z])(c)
to AA\2\1BB
) it took 20 to 40 times as much time as C.
It’s not Just-In-Time compiling yet, but I think, if I could do Just-In-Time compiling, It could speed up a lot more.
PS: I use profiling for each regex pattern during compiling patterns,
for example, profile 1 for simple one like ab
, profile 2 for range [a-z]c
, profile 3 with grouping ([a-z])(c)
, each profile has separate codes, so no extra cost needed when matching and replacing simple patterns.
I have tried it with psyco, and it doesn’t improve the speed that much. May be because I am doing text replacing against big data, not looping many times.
If I am not wrong, Python’s re.sub
is running it natively already I think, so pysco cannot improve the speed that much.
I have tried with boost regex wrapped into python, but it’s even slower than Python’s regex, so it seems like the bottleneck is in Python’s string processing and Jan Goyvaerts has also pointed me to that in the answer.
I’d like to convert regex pattern ab[a-z]c
to machine code, like the following equivalent C code (*s
points to 10MB long texts):
do{
if(*s=='a' && s[1]=='b' && s[2]>='a' && s[2]<='z' && s[3]=='c') return 1;
}while(*s++);
return 0;
Any ideas?
There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression. A few utilities like awk and egrep use the extended expression. Most use the "basic" regular expression. From now on, if I talk about a "regular expression," it describes a feature in both types.
Regular expressions are useful in any scenario that benefits from full or partial pattern matches on strings. These are some common use cases: verify the structure of strings. extract substrings form structured strings.
Despite being hard to read, hard to validate, hard to document and notoriously hard to master, regexes are still widely used today. Supported by all modern programming languages, text processing programs and advanced text editors, regexes are now used in more than a third of both Python and JavaScript projects.
The compile() function takes as input a simple regular expression and produces a compiled expression that can be used with the step() and advance() functions. The first parameter instring is never used explicitly by compile(). It is a pointer to a character string defining a source regular expression.
PCRE has a JIT compiler since 8.20. You can read about here: http://sljit.sourceforge.net/pcre.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With