Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does it mean that there is faster failure with atomic grouping

Tags:

regex

pcre

NOTE :- The question is bit long as it includes a section from book.

I was reading about atomic groups from Mastering Regular Expression.

It is given that atomic groups leads to faster failure. Quoting that particular section from the book

Faster failures with atomic grouping. Consider ^\w+: applied to Subject. We can see, just by looking at it, that it will fail because the text doesn’t have a colon in it, but the regex engine won’t reach that conclusion until it actually goes through the motions of checking.

So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match). When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

enter image description here

at which point the : fails again, this time trying to match t. This backtrack-test fail cycle happens all the way back to the oldest state:

enter image description here

After the attempt from the final state fails, overall failure can finally be announced. All that backtracking is a lot of work that after just a glance we know to be unnecessary. If the colon can’t match after the last letter, it certainly can’t match one of the letters the + is forced to give up!

So, knowing that none of the states left by \w+, once it’s finished, could possibly lead to a match, we can save the regex engine the trouble of checking them: ^(?>\w+):. By adding the atomic grouping, we use our global knowledge of the regex to enhance the local working of \w+ by having its saved states (which we know to be useless) thrown away. If there is a match, the atomic grouping won’t have mattered, but if there’s not to be a match, having thrown away the useless states lets the regex come to that conclusion more quickly.


I tried these regex here. It took 4 steps for ^\w+: and 6 steps for ^(?>\w+): (with internal engine optimization disabled)


My Questions

  1. In the second paragraph from above section, it is mentioned that

So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match).When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

enter image description here

at which point the : fails again, this time trying to match t. This backtrack-test fail cycle happens all the way back to the oldest state:

enter image description here

but on this site, I see no backtracking. Why?

Is there some optimization going on inside(even after it is disabled)?

  1. Can the number of steps taken by a regex decide whether one regex is having good performance over other regex?
like image 795
rock321987 Avatar asked May 16 '16 10:05

rock321987


People also ask

What is the relationship between fail often and fail fast?

It closely links to Fail Fast, Fail Often because there are so many benefits that can help you deal with quickly expanding changes. It allows fast scaling to customer requirements by removing the bottleneck that occurs during peak times.

What is an atomic group in regex?

An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group.

What is agile fail fast fail often?

Fail Fast, Fail Often is a highly significant area of the Agile philosophy. Let’s take a look at its individual components and how to utilize them: Fail Early | The belief is that if it is possible to learn from failure then the sooner the failure occurs, the sooner the learning begins.

Which regular expression flavors support atomic grouping?

Atomic grouping is supported by most modern regular expression flavors, including the JGsoft flavor, Java, PCRE, .NET, Perl, Boost, and Ruby. Most of these also support possessive quantifiers, which are essentially a notational convenience for atomic grouping. An example will make the behavior of atomic groups clear.


1 Answers

The debugger on that site seems to gloss over the details of backtracking. RegexBuddy does a better job. Here's what it shows for ^\w+:

Normal (greedy)

After \w+ consumes all the letters, it tries to match : and fails. Then it gives back one character, tries the : again, and fails again. And so on, until there's nothing left to give back. Fifteen steps total. Now look at the atomic version (^(?>\w+):):

Atomic

After failing to match the : the first time, it gives back all the letters at once, as if they were one character. A total of five steps, and two of those are entering and leaving the group. And using a possessive quantifier (^\w++:) eliminates even those:

Possessive

As for your second question, yes, the number-of-steps metric from regex debuggers is useful, especially if you're just learning regexes. Every regex flavor has at least a few optimizations that allow even badly written regexes to perform adequately, but a debugger (especially a flavor-neutral one like RegexBuddy's) makes it obvious when you're doing something wrong.

like image 137
Alan Moore Avatar answered Sep 25 '22 14:09

Alan Moore