NOTE :- The question is bit long as it includes a section from book.
I was reading about atomic groups
from Mastering Regular Expression.
It is given that atomic groups
leads to faster failure. Quoting that particular section from the book
Faster failures with atomic grouping. Consider
^\w+:
applied toSubject
. We can see, just by looking at it, that it will fail because the text doesn’t have a colon in it, but the regex engine won’t reach that conclusion until it actually goes through the motions of checking.So, by the time
:
is first checked, the\w+
will have marched to the end of the string. This results in a lot of states — oneskip me
state for each match of\w
by the plus (except the first, since plus requires one match). When then checked at the end of the string,:
fails, so the regex engine backtracks to the most recently saved state:
at which point the
:
fails again, this time trying to matcht
. This backtrack-test fail cycle happens all the way back to the oldest state:After the attempt from the final state fails, overall failure can finally be announced. All that backtracking is a lot of work that after just a glance we know to be unnecessary. If the colon can’t match after the last letter, it certainly can’t match one of the letters the
+
is forced to give up!So, knowing that none of the states left by
\w+
, once it’s finished, could possibly lead to a match, we can save the regex engine the trouble of checking them:^(?>\w+):
. By adding the atomic grouping, we use our global knowledge of the regex to enhance the local working of\w+
by having its saved states (which we know to be useless) thrown away. If there is a match, the atomic grouping won’t have mattered, but if there’s not to be a match, having thrown away the useless states lets the regex come to that conclusion more quickly.
I tried these regex here. It took 4 steps for ^\w+:
and 6 steps for ^(?>\w+):
(with internal engine optimization disabled)
My Questions
So, by the time
:
is first checked, the\w+
will have marched to the end of the string. This results in a lot of states — one skip me state for each match of\w
by the plus (except the first, since plus requires one match).When then checked at the end of the string,:
fails, so the regex engine backtracks to the most recently saved state:
at which point the
:
fails again, this time trying to matcht
. This backtrack-test fail cycle happens all the way back to the oldest state:
but on this site, I see no backtracking. Why?
Is there some optimization going on inside(even after it is disabled)?
It closely links to Fail Fast, Fail Often because there are so many benefits that can help you deal with quickly expanding changes. It allows fast scaling to customer requirements by removing the bottleneck that occurs during peak times.
An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group.
Fail Fast, Fail Often is a highly significant area of the Agile philosophy. Let’s take a look at its individual components and how to utilize them: Fail Early | The belief is that if it is possible to learn from failure then the sooner the failure occurs, the sooner the learning begins.
Atomic grouping is supported by most modern regular expression flavors, including the JGsoft flavor, Java, PCRE, .NET, Perl, Boost, and Ruby. Most of these also support possessive quantifiers, which are essentially a notational convenience for atomic grouping. An example will make the behavior of atomic groups clear.
The debugger on that site seems to gloss over the details of backtracking. RegexBuddy does a better job. Here's what it shows for ^\w+:
After \w+
consumes all the letters, it tries to match :
and fails. Then it gives back one character, tries the :
again, and fails again. And so on, until there's nothing left to give back. Fifteen steps total. Now look at the atomic version (^(?>\w+):
):
After failing to match the :
the first time, it gives back all the letters at once, as if they were one character. A total of five steps, and two of those are entering and leaving the group. And using a possessive quantifier (^\w++:
) eliminates even those:
As for your second question, yes, the number-of-steps metric from regex debuggers is useful, especially if you're just learning regexes. Every regex flavor has at least a few optimizations that allow even badly written regexes to perform adequately, but a debugger (especially a flavor-neutral one like RegexBuddy's) makes it obvious when you're doing something wrong.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With