Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is regex in perl faster than in Java or other languages? [closed]

I have heard from time to time from people, who said that regex in Perl is faster than in other languages. Also, some online documents also say Perl has advantages when it comes to regex processing. Can you guys explain if this is true, and why?

like image 478
Simo Avatar asked Jun 14 '13 23:06

Simo


People also ask

Is Perl faster than Java?

I needed to get an measure of the difference in performance between Perl and Java for a simple client application, so I wrote the traditional 'Hello World' app in both and ran a bunch of executions averaging over the time from start to end of execution. The net result: Perl is around 34 times faster than Java.

Is regex fast in Java?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement. If you are using regex as to match very small text and small pattern and don't do it because the matcher function . find() is slower than a normal if statement of a switch statement. Save this answer.

Is Perl faster than grep?

Conclusion. The regex engine in Perl is much faster than the regex engine of Python. The are both slower than grep as measured when I compares Python with grep.

Are regex fast?

Regular expressions are not always faster than a simple search. It all depends on context. It depends on the complexity of the expression, the length of the document being searched, and a whole host of factors. What happens is that the regular expression will be compiled into a simple parser (which takes time).


2 Answers

Why would you consider the speed of two engines when one of them (Java's) is notably buggy? (Search for writings by Tom "tchrist" Christiansen on the subject.) For example, \s fails to match many space characters.

Also, some online documents also say Perl has advantages when it comes to regex processing.

Here are some:

  • Has many features you cannot find in other engines, either because the other engines haven't copied them yet, or because their design does not permit them to support those features.
  • Highly optimised. Many of these optimisations help to report failed matches sooner, something not covered by many benchmarks.
  • A leader in Unicode support. It's support is so advanced that our developers are finding problems with the Unicode standard itself and have worked to have them resolved!
  • Remarkly bug free.
like image 134
ikegami Avatar answered Oct 06 '22 01:10

ikegami


You may have a look at this benchmark. In the table, column patmch:1t gives the time on matching URL with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)/, while column patmch:2t on matching URL or email with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)|([^ @]+)@([^ @]+)/ (note the | operator). For the first pattern, Perl is about 10X faster than Java; for the second, they are about the same.

In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow. In the extreme case, its match speed is exponential in the length of the pattern. Another type of regex engine is based on NFA. It is harder to implement but has stable performance (quadratic at the worst IIRC) for all types of input. Russ Cox has several articles about these topics, which I like a lot.

I don't know what types of regex engine Java is using, but from the benchmark, its performance does not seem impressive. You may also be interested in this benchmark which evaluates several C/C++ libraries on regex.

EDIT: In both benchmarks, patterns are tested against an old version of Linux Howto. The vast majority of lines do not have a match.

About DFA vs. NFA: if I am right, a pure DFA cannot capture groups, at least not easily. Only NFA can capture groups. I heard that RE2 transform local NFA to DFA for the part of regex without group captures. I do not know if this is true.

On PCRE: PCRE has the same problem as Perl - inefficient given complex alternations. You may have a look at the regex-dna benchmark from the Computer Language Benchmarks Game. Versions using PCRE are all much slower than the fastest version that is using TCL (maybe PCRE is not using trie?). V8 is clearly the winner in that benchmark because it does not use backtrack. IMO, for C++ programmers, the best regex library is RE2.

like image 45
user172818 Avatar answered Oct 05 '22 23:10

user172818