How to order regular expression alternatives to get longest match?

Q: Does order of regex matter?

The order of the characters inside a character class does not matter. The results are identical. You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9.

Q: Which is faster for loop or regex?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement. If you are using regex as to match very small text and small pattern and don't do it because the matcher function .

Q: Is regex matching fast?

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow.

Tags:

language-agnostic

regex

I have a number of regular expressions regex1, regex2, ..., regexN combined into a single regex as regex1|regex2|...|regexN. I would like to reorder the component expressions so that the combined expression gives the longest possible match at the beginning of a given string.

I believe this means reordering the regular expressions such that "if regexK matches a prefix of regexL, then L < K". If this is correct, is it possible to find out, in general, whether regexK can match a prefix of regexL?

795

asked Mar 14 '16 20:03

user200783

1 Answers

Use the right regex flavor!

In some regex flavors, the alternation providing the longest match is the one that is used ("greedy alternation"). Note that most of these regex flavors are old (yet still used today), and thus lack some modern constructs such as back references.

Perl6 is modern (and has many features), yet defaults to the POSIX-style longest alternation. (You can even switch styles, as || creates an alternator that short-circuits to first match.) Note that the :Perl5/:P5 modifier is needed in order to use the "traditional" regex style.

Also, PCRE and the newer PCRE2 have functions that do the same. In PCRE2, it's pcre2_dfa_match. (See my section Relevant info about regex engine design section for more information about DFAs.)

This means, you can have ANY order of statements in a pipe and the result will always be the longest.

(This is different from the "absolute longest" match, as no amount of rearranging the terms in an alternation will change the fact that all regex engines traverse the string left-to-right. With the exception of .NET, apparently, which can go right-to-left. But traversing the string backwards wouldn't guarantee the "absolute longest" match either.) If you really want to find matches at (only) the beginning of a string, you should anchor the expression: ^(regex1|regex2|...).

According to this page*:

The POSIX standard, however, mandates that the longest match be returned. When applying Set|SetValue to SetValue, a POSIX-compliant regex engine will match SetValue entirely.

* Note: I do not have the ability to test every POSIX flavor. Also, some regex flavors (Perl6) have this behavior without being POSIX compliant overall.

Let me give you one specific example that I have verified on my own computer:

echo "ab c a" | sed -E 's/(a|ab)/replacement/'

The regex is (a|ab). When it runs on the string ab c a you get : replacement c a, meaning that you do, in fact, get the longest match that the alternator can provide.

This regex, for a more complex example, (a|ab.*c|.{0,2}c*d) applied to abcccd, will return abcccd.

Try it here!

More clarification: the regex engine will not go forward (in the search string) to see if there is an even longer match once it can match something. It will only look through the current list of alterations to see if another one will match a longer string (from the position where the initial match starts).

In other words, no matter the order of choices in an alteration, POSIX compliant regexes use the one that matches the most characters.

Other examples of flavors with this behavior:

Tcl ARE
POSIX ERE
GNU BRE
GNU ERE

Relevant information about regex engine design

This question asks about designing an engine, but the answers may be helpful to understand how these engines work. Essentially, DFA-based algorithms determine the common overlap of different expressions, especially those within an alternation. It might be worth checking out this page. It explains how alternatives can be combined into a single path: Thompson algorithm for alternation]]

Note: at some point, you might just want to consider using an actual programming language. Regexes aren't everything.

120

answered Oct 11 '22 12:10

Laurel

Related questions
                            
                                Regex to extract domain and video id from youtube/vimeo url
                            
                                How can I escape slash sign `/` in Apache `<If>` directive's regex?
                            
                                Perl replace multiple strings simultaneously
                            
                                How do I configure Jenkins to build all branches except a few which I exclude?
                            
                                Regex to check alphanumeric string in ruby
                            
                                Understanding Quantifiers
                            
                                Is syntax-highlighting programming languages using regular expressions possible?
                            
                                No matches with c++11 regex [duplicate]
                            
                                What does the o modifier for a regexp mean?
                            
                                Regular Expression - Two Digit Range (23-79)?
                            
                                Python regex search for string at beginning of line in file
                            
                                Bug in Pattern.asPredicate?
                            
                                Remove text after the second space
                            
                                XML schema restriction pattern for not allowing empty strings
                            
                                Has anyone found that REGEX "\b" doesn't work in MYSQL?
                            
                                "preg_match(): Compilation failed: unmatched parentheses" in PHP for valid pattern
                            
                                Extract email and name with regex
                            
                                Python regex matching all but last occurrence
                            
                                Replace x with y or append y if no x
                            
                                Bash need to test for alphanumeric string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With