Is regex in perl faster than in Java or other languages? [closed]

Tags:

I have heard from time to time from people, who said that regex in Perl is faster than in other languages. Also, some online documents also say Perl has advantages when it comes to regex processing. Can you guys explain if this is true, and why?

478

asked Jun 14 '13 23:06

Simo

2 Answers

Why would you consider the speed of two engines when one of them (Java's) is notably buggy? (Search for writings by Tom "tchrist" Christiansen on the subject.) For example, \s fails to match many space characters.

Also, some online documents also say Perl has advantages when it comes to regex processing.

Here are some:

Has many features you cannot find in other engines, either because the other engines haven't copied them yet, or because their design does not permit them to support those features.
Highly optimised. Many of these optimisations help to report failed matches sooner, something not covered by many benchmarks.
A leader in Unicode support. It's support is so advanced that our developers are finding problems with the Unicode standard itself and have worked to have them resolved!
Remarkly bug free.

134

answered Oct 06 '22 01:10

ikegami

You may have a look at this benchmark. In the table, column patmch:1t gives the time on matching URL with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)/, while column patmch:2t on matching URL or email with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)|([^ @]+)@([^ @]+)/ (note the | operator). For the first pattern, Perl is about 10X faster than Java; for the second, they are about the same.

In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow. In the extreme case, its match speed is exponential in the length of the pattern. Another type of regex engine is based on NFA. It is harder to implement but has stable performance (quadratic at the worst IIRC) for all types of input. Russ Cox has several articles about these topics, which I like a lot.

I don't know what types of regex engine Java is using, but from the benchmark, its performance does not seem impressive. You may also be interested in this benchmark which evaluates several C/C++ libraries on regex.

EDIT: In both benchmarks, patterns are tested against an old version of Linux Howto. The vast majority of lines do not have a match.

About DFA vs. NFA: if I am right, a pure DFA cannot capture groups, at least not easily. Only NFA can capture groups. I heard that RE2 transform local NFA to DFA for the part of regex without group captures. I do not know if this is true.

On PCRE: PCRE has the same problem as Perl - inefficient given complex alternations. You may have a look at the regex-dna benchmark from the Computer Language Benchmarks Game. Versions using PCRE are all much slower than the fastest version that is using TCL (maybe PCRE is not using trie?). V8 is clearly the winner in that benchmark because it does not use backtrack. IMO, for C++ programmers, the best regex library is RE2.

answered Oct 05 '22 23:10

user172818

Related questions
                            
                                Get only digits using regexp
                            
                                Regex for hour, minute and second format HH:MM:SS
                            
                                Remove trailing numbers from string js regexp
                            
                                filter DataFrame with Regex with Spark in Scala
                            
                                Python regex: how to replace each instance of an occurrence with a different value?
                            
                                Javascript regular expression that matches two strings
                            
                                regex for time validation
                            
                                Regular expression to allow backslash in C#
                            
                                Splitting a string in C#
                            
                                Regular Expression for UpperCase Letters In A String
                            
                                regex for email validation [duplicate]
                            
                                Using Vim, how can I make CSS rules into one liners?
                            
                                Regex: only alphanumeric but not if this is pure numeric
                            
                                How to convert a vector of strings to Title Case
                            
                                Regular Expression for domain from email address
                            
                                Replace non-numeric characters
                            
                                Regex to extract attribute value
                            
                                Bash Script Regular Expressions...How to find and replace all matches?
                            
                                Regex to match a specific group of digits of certain length?
                            
                                TextField Validation With Regular Expression

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is regex in perl faster than in Java or other languages? [closed]

Tags:

performance

regex

perl

Simo

People also ask

2 Answers

ikegami

user172818

Recent Activity

Donate For Us