Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any regular expression engine that does Just-In-Time compiling? [closed]

Tags:

c++

python

c

regex

jit

My Questions is

Is there any regular expression engine that does Just-In-Time compiling during regex pattern parsing and use when matching / replacing the texts? Or where can I learn JIT for i386 or x64 architecture?

Why I need it

I was recently trying to benchmark Python’s built-in regex engine compared with normal C code with around 10MB of data.

I found that for a straightforward replacing (for example ab to zzz) it’s relatively fast: just 2 to 3 times slower than C.

But for [a-z]c it took around 5 to 8 times as much time as C.

And with grouping (e.g. ([a-z])(c) to AA\2\1BB ) it took 20 to 40 times as much time as C.

It’s not Just-In-Time compiling yet, but I think, if I could do Just-In-Time compiling, It could speed up a lot more.

PS: I use profiling for each regex pattern during compiling patterns, for example, profile 1 for simple one like ab, profile 2 for range [a-z]c, profile 3 with grouping ([a-z])(c), each profile has separate codes, so no extra cost needed when matching and replacing simple patterns.

Update 1

I have tried it with psyco, and it doesn’t improve the speed that much. May be because I am doing text replacing against big data, not looping many times.

If I am not wrong, Python’s re.sub is running it natively already I think, so pysco cannot improve the speed that much.

Update 2

I have tried with boost regex wrapped into python, but it’s even slower than Python’s regex, so it seems like the bottleneck is in Python’s string processing and Jan Goyvaerts has also pointed me to that in the answer.

Update

I’d like to convert regex pattern ab[a-z]c to machine code, like the following equivalent C code (*s points to 10MB long texts):

do{
    if(*s=='a' && s[1]=='b' && s[2]>='a' && s[2]<='z' && s[3]=='c') return 1;
}while(*s++);
return 0;

Any ideas?

like image 934
YOU Avatar asked Nov 20 '09 08:11

YOU


People also ask

What are different types of regular expression?

There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression. A few utilities like awk and egrep use the extended expression. Most use the "basic" regular expression. From now on, if I talk about a "regular expression," it describes a feature in both types.

Which are 3 uses of regular expression?

Regular expressions are useful in any scenario that benefits from full or partial pattern matches on strings. These are some common use cases: verify the structure of strings. extract substrings form structured strings.

Are regular expressions still used?

Despite being hard to read, hard to validate, hard to document and notoriously hard to master, regexes are still widely used today. Supported by all modern programming languages, text processing programs and advanced text editors, regexes are now used in more than a third of both Python and JavaScript projects.

How do you compile in regular expressions?

The compile() function takes as input a simple regular expression and produces a compiled expression that can be used with the step() and advance() functions. The first parameter instring is never used explicitly by compile(). It is a pointer to a character string defining a source regular expression.


1 Answers

PCRE has a JIT compiler since 8.20. You can read about here: http://sljit.sourceforge.net/pcre.html

like image 69
dark100 Avatar answered Oct 20 '22 00:10

dark100