What does it mean that there is faster failure with atomic grouping

Tags:

pcre

NOTE :- The question is bit long as it includes a section from book.

I was reading about atomic groups from Mastering Regular Expression.

It is given that atomic groups leads to faster failure. Quoting that particular section from the book

Faster failures with atomic grouping. Consider ^\w+: applied to Subject. We can see, just by looking at it, that it will fail because the text doesn’t have a colon in it, but the regex engine won’t reach that conclusion until it actually goes through the motions of checking.

So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match). When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

at which point the : fails again, this time trying to match t. This backtrack-test fail cycle happens all the way back to the oldest state:

After the attempt from the final state fails, overall failure can finally be announced. All that backtracking is a lot of work that after just a glance we know to be unnecessary. If the colon can’t match after the last letter, it certainly can’t match one of the letters the + is forced to give up!

So, knowing that none of the states left by \w+, once it’s finished, could possibly lead to a match, we can save the regex engine the trouble of checking them: ^(?>\w+):. By adding the atomic grouping, we use our global knowledge of the regex to enhance the local working of \w+ by having its saved states (which we know to be useless) thrown away. If there is a match, the atomic grouping won’t have mattered, but if there’s not to be a match, having thrown away the useless states lets the regex come to that conclusion more quickly.

I tried these regex here. It took 4 steps for ^\w+: and 6 steps for ^(?>\w+): (with internal engine optimization disabled)

My Questions

In the second paragraph from above section, it is mentioned that

So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match).When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

at which point the : fails again, this time trying to match t. This backtrack-test fail cycle happens all the way back to the oldest state:

but on this site, I see no backtracking. Why?

Is there some optimization going on inside(even after it is disabled)?

Can the number of steps taken by a regex decide whether one regex is having good performance over other regex?

795

asked May 16 '16 10:05

1 Answers

The debugger on that site seems to gloss over the details of backtracking. RegexBuddy does a better job. Here's what it shows for ^\w+:

Normal (greedy)

After \w+ consumes all the letters, it tries to match : and fails. Then it gives back one character, tries the : again, and fails again. And so on, until there's nothing left to give back. Fifteen steps total. Now look at the atomic version (^(?>\w+):):

Atomic

After failing to match the : the first time, it gives back all the letters at once, as if they were one character. A total of five steps, and two of those are entering and leaving the group. And using a possessive quantifier (^\w++:) eliminates even those:

Possessive

As for your second question, yes, the number-of-steps metric from regex debuggers is useful, especially if you're just learning regexes. Every regex flavor has at least a few optimizations that allow even badly written regexes to perform adequately, but a debugger (especially a flavor-neutral one like RegexBuddy's) makes it obvious when you're doing something wrong.

137

answered Sep 25 '22 14:09

Alan Moore

Related questions
                            
                                Why is lookahead (sometimes) faster than capturing?
                            
                                Regular Expression to Clean a numbered list
                            
                                HTML5 Input Pattern vs. Non-Latin Letters
                            
                                Python 3 regex with diacritics and ligatures,
                            
                                Efficiently find which group matched in a RegExp search
                            
                                Case-insensitive matching in Marpa
                            
                                Get Parameter Names from SQL Query
                            
                                How to generate multiple parse trees for an ambiguous sentence in NLTK?
                            
                                Get every value from a box-shadow by regex
                            
                                regex_token_iterator *it++ bug?
                            
                                Parsing VBA Const declarations... with regex
                            
                                PCRE PHP Concrete example of the usage and utility of the "S" (Extra analysis of pattern) modifier?
                            
                                Matching extended ASCII characters in .NET Regex
                            
                                How to fix a regex that attemps to catch some word and id?
                            
                                PCRE: backreferences not allowed in lookbehinds?
                            
                                difference in match due to the position of negative lookahead?
                            
                                Python regex partial extract
                            
                                Typescript: How to write long regexp in 2 lines [duplicate]
                            
                                Documentation for AWS API Gateway Lambda Error Regex?
                            
                                Regex <img > Tag parsing with src, width, height

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does it mean that there is faster failure with atomic grouping

Tags:

regex

pcre

rock321987

People also ask

1 Answers

Alan Moore

Recent Activity

Donate For Us