Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does non-backtracking subexpression work "(?>exp)"

Tags:

c#

.net

regex

I am trying to become better at regular expressions. I am having a hard time trying to understand what does (?> expression ) means. Where can I find more info on non-backtacking subexpressoins? The description of THIS link says:

Greedy subexpression, also known as a non-backtracking subexpression. This is matched only once and then does not participate in backtracking.

this other link: http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.71).aspx has also a definition of non-backtracking subexpression but I still am having a hard time understanding what it means plus I cannot think of an example where I will use (?>exp)

like image 308
Tono Nam Avatar asked Jul 24 '12 13:07

Tono Nam


3 Answers

As always, regular-expressions.info is a good place to start.

Use an atomic group if you want to make sure that whatever has once been matched will stay part of the match.

For example, to match a number of "words" that may or may not be separated by spaces, then followed by a colon, a user tried the regex:

(?:[A-Za-z0-9_.&,-]+\s*)+:

When there was a match, everything was fine. But when there wasn't, his PC would become non-responsive with 100% CPU load because of catastrophic backtracking because the regex engine would vainly try to find a matching combination of words that would allow the following colon to match. Which was of course impossible.

By using an atomic group, this could have been prevented:

(?>[A-Za-z0-9_.&,-]+\s*)+:

Now whatever has been matched stays matched - no backtracking and therefore fast failing times.

Another good example I recently came across:

If you want to match all numbers that are not followed by an ASCII letter, you might want to use the regex \d+(?![A-Za-z]). However, this will fail with inputs like 123a because the regex engine will happily return the match 12 by backtracking until the following character is no longer a letter. If you use (?>\d+)(?![A-Za-z]), this won't happen. (Of course, \d+(?![\dA-Za-z]) would also work)

like image 180
Tim Pietzcker Avatar answered Oct 04 '22 15:10

Tim Pietzcker


The Regex Tutorial has a page on it here: http://www.regular-expressions.info/atomic.html

Basically what it does is discards backtracking information, meaning that a(?>bc|b)c matches abcc but not abc.

The reason it doesn't match the second string is because it finds a match with bc, and discards backtracking information about the bc|b alternation. It essentially forgets the |b part of it. Therefore, there is no c after the bc, and the match fails.

The most useful method of using atomic groups, as they are called, is to optimize slow regexes. You can find more information on the aforementioned page.

like image 28
Kendall Frey Avatar answered Oct 04 '22 17:10

Kendall Frey


Read up on possessive quantifiers [a-z]*+ make the backtracking engine remember only the previous step that matched not all of the previous steps that matched.

This is useful when a lot of acceptable steps are probable and they will eat up memory if each step is stored for any possible backtracking regression.

Possessive quantifiers are a shorthand for atomic groups.

like image 25
Mihai Stancu Avatar answered Oct 04 '22 15:10

Mihai Stancu