I have heard from time to time from people, who said that regex in Perl is faster than in other languages. Also, some online documents also say Perl has advantages when it comes to regex processing. Can you guys explain if this is true, and why?
I needed to get an measure of the difference in performance between Perl and Java for a simple client application, so I wrote the traditional 'Hello World' app in both and ran a bunch of executions averaging over the time from start to end of execution. The net result: Perl is around 34 times faster than Java.
Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement. If you are using regex as to match very small text and small pattern and don't do it because the matcher function . find() is slower than a normal if statement of a switch statement. Save this answer.
Conclusion. The regex engine in Perl is much faster than the regex engine of Python. The are both slower than grep as measured when I compares Python with grep.
Regular expressions are not always faster than a simple search. It all depends on context. It depends on the complexity of the expression, the length of the document being searched, and a whole host of factors. What happens is that the regular expression will be compiled into a simple parser (which takes time).
Why would you consider the speed of two engines when one of them (Java's) is notably buggy? (Search for writings by Tom "tchrist" Christiansen on the subject.) For example, \s
fails to match many space characters.
Also, some online documents also say Perl has advantages when it comes to regex processing.
Here are some:
You may have a look at this benchmark. In the table, column patmch:1t
gives the time on matching URL with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)/
, while column patmch:2t
on matching URL or email with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)|([^ @]+)@([^ @]+)/
(note the |
operator). For the first pattern, Perl is about 10X faster than Java; for the second, they are about the same.
In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the |
operator, it may become very slow. In the extreme case, its match speed is exponential in the length of the pattern. Another type of regex engine is based on NFA. It is harder to implement but has stable performance (quadratic at the worst IIRC) for all types of input. Russ Cox has several articles about these topics, which I like a lot.
I don't know what types of regex engine Java is using, but from the benchmark, its performance does not seem impressive. You may also be interested in this benchmark which evaluates several C/C++ libraries on regex.
EDIT: In both benchmarks, patterns are tested against an old version of Linux Howto. The vast majority of lines do not have a match.
About DFA vs. NFA: if I am right, a pure DFA cannot capture groups, at least not easily. Only NFA can capture groups. I heard that RE2 transform local NFA to DFA for the part of regex without group captures. I do not know if this is true.
On PCRE: PCRE has the same problem as Perl - inefficient given complex alternations. You may have a look at the regex-dna benchmark from the Computer Language Benchmarks Game. Versions using PCRE are all much slower than the fastest version that is using TCL (maybe PCRE is not using trie?). V8 is clearly the winner in that benchmark because it does not use backtrack. IMO, for C++ programmers, the best regex library is RE2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With