I have gone through the docs for Atomic Grouping and rubyinfo and some questions came into my mind: <ol> <li>Why the name "Atomic grouping"? What "atomicity" does it have that general grouping doesn't?</li> <li>How does atomic grouping differ to general grouping?</li> <li>Why are atomic groups called non-capturing groups?</li> </ol> I tried the below code to understand but had confusion about the output and how differently they work on the same string as well? <pre class="prettyprint"><code>irb(main):001:0> /a(?>bc|b)c/ =~ "abbcdabcc" => 5 irb(main):004:0> $~ => #<MatchData "abcc"> irb(main):005:0> /a(bc|b)c/ =~ "abcdabcc" => 0 irb(main):006:0> $~ => #<MatchData "abc" 1:"b"> </code></pre>

I recently had to explain Atomic Groups to someone else and I thought I'd tweak and share the example here. Consider <code>/the (big|small|biggest) (cat|dog|bird)/</code> Matches in bold <ul> <li>the big dog</li> <li>the small bird</li> <li>the biggest dog</li> <li>the small cat</li> </ul> DEMO For the first line, a regex engine would find <code>the </code>. It would then proceed on to our adjectives (<code>big</code>, <code>small</code>, <code>biggest</code>), it finds <code>big</code>. Having matched <code>big</code>, it proceeds and finds the space. It then looks at our pets (<code>cat</code>, <code>dog</code>, <code>bird</code>), finds <code>cat</code>, skips it, and finds <code>dog</code>. For the second line, our regex would find <code>the </code>. It would proceed and look at <code>big</code>, skip it, look at and find <code>small</code>. It finds the space, skips <code>cat</code> and <code>dog</code> because they don't match, and finds <code>bird</code>. For the third line, our regex would find <code>the </code>, It continues on and finds <code>big</code> which matches the immediate requirement, and proceeds. It can't find the space, so it backtracks (rewinds the position to the last choice it made). It skips <code>big</code>, skips <code>small</code>, and finds <code>biggest</code> which also matches the immediate requirement. It then finds the space. It skips <code>cat</code> , and matches <code>dog</code>. For the fourth line, our regex would find <code>the </code>. It would proceed to look at <code>big</code>, skip it, look at and find <code>small</code>. It then finds the space. It looks at and matches <code>cat</code>. <hr> Consider <code>/the (?>big|small|biggest) (cat|dog|bird)/</code> Note the <code>?></code> atomic group on adjectives. Matches in bold <ul> <li>the big dog</li> <li>the small bird</li> <li>the biggest dog</li> <li>the small cat</li> </ul> DEMO For the first line, second line, and fourth line, we'll get the same result. For the third line, our regex would find <code>the </code>, It continues on and find <code>big</code> which matches the immediate requirement, and proceeds. It can't find the space, but the atomic group, being the last choice the engine made, won't allow that choice to be re-examined (prohibits backtracking). Since it can't make a new choice, the match has to fail, since our simple expression has no other choices. <hr> This is only a basic summary. An engine wouldn't need to look at the entirety of <code>cat</code> to know that it doesn't match <code>dog</code>, merely looking at the <code>c</code> is enough. When trying to match bird, the <code>c</code> in <code>cat</code> and the <code>d</code> in dog are enough to tell the engine to examine other options. However if you had ...<code>((cat|snake)|dog|bird)</code>, the engine would also, of course, need to examine snake before it dropped to the previous group and examined dog and bird. There are also plenty of choices an engine can't decide without going past what may not seem like a match, which is what results in backtracking. If you have <code>((red)?cat|dog|bird)</code>, The engine will look at <code>r</code>, back out, notice the <code>?</code> quantifier, ignore the subgroup <code>(red)</code>, and look for a match.

Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?

Tags:

regex

ruby

ruby-1.9.3

I have gone through the docs for Atomic Grouping and rubyinfo and some questions came into my mind:

Why the name "Atomic grouping"? What "atomicity" does it have that general grouping doesn't?
How does atomic grouping differ to general grouping?
Why are atomic groups called non-capturing groups?

I tried the below code to understand but had confusion about the output and how differently they work on the same string as well?

irb(main):001:0> /a(?>bc|b)c/ =~ "abbcdabcc" => 5 irb(main):004:0> $~ => #<MatchData "abcc"> irb(main):005:0> /a(bc|b)c/ =~ "abcdabcc" => 0 irb(main):006:0> $~ => #<MatchData "abc" 1:"b">

625

asked Jan 19 '13 06:01

Arup Rakshit

2 Answers

A () has some properties (include those such as (?!pattern), (?=pattern), etc. and the plain (pattern)), but the common property between all of them is grouping, which makes the arbitrary pattern a single unit (unit is my own terminology), which is useful in repetition.

The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property, so it will save a bit of space and speed up a bit compared to (pattern) since it doesn't store the start and end index of the string matching the pattern inside.

Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured.

Atomic grouping adds property of atomic compared to capturing or non-capturing group. Atomic here means: at the current position, find the first sequence (first is defined by how the engine matches according to the pattern given) that matches the pattern inside atomic grouping and hold on to it (so backtracking is disallowed).

A group without atomicity will allow backtracking - it will still find the first sequence, then if the matching ahead fails, it will backtrack and find the next sequence, until a match for the entire regex expression is found or all possibilities are exhausted.

Example

Input string: bbabbbabbbbc
Pattern: /(?>.*)c/

The first match by .* is bbabbbabbbbc due to the greedy quantifier *. It will hold on to this match, disallowing c from matching. The matcher will retry at the next position to the end of the string, and the same thing happens. So nothing matches the regex at all.

Input string: bbabbbabbbbc
Pattern: /((?>.*)|b*)[ac]/, for testing /(((?>.*))|(b*))[ac]/

There are 3 matches to this regex, which are bba, bbba, bbbbc. If you use the 2nd regex, which is the same but with capturing groups added for debugging purpose, you can see that all the matches are result of matching b* inside.

You can see the backtracking behavior here.

Without the atomic grouping /(.*|b*)[ac]/, the string will have a single match which is the whole string, due to backtracking at the end to match [ac]. Note that the engine will go back to .* to backtrack by 1 character since it still have other possibilities.

Pattern: /(.*|b*)[ac]/ bbabbbabbbbc ^             -- Start matching. Look at first item in alternation: .* bbabbbabbbbc             ^ -- First match of .*, due to greedy quantifier bbabbbabbbbc             X -- [ac] cannot match               -- Backtrack to ()       bbabbbabbbbc            ^  -- Continue explore other possibility with .*               -- Step back 1 character bbabbbabbbbc             ^ -- [ac] matches, end of regex, a match is found

With the atomic grouping, all possibilities of .* is cut off and limited to the first match. So after greedily eating the whole string and fail to match, the engine have to go for the b* pattern, where it successfully finds a match to the regex.

Pattern: /((?>.*)|b*)[ac]/ bbabbbabbbbc ^             -- Start matching. Look at first item in alternation: (?>.*) bbabbbabbbbc             ^ -- First match of .*, due to greedy quantifier               -- The atomic grouping will disallow .* to be backtracked and rematched bbabbbabbbbc             X -- [ac] cannot match               -- Backtrack to ()               -- (?>.*) is atomic, check the next possibility by alternation: b* bbabbbabbbbc ^             -- Starting to rematch with b* bbabbbabbbbc   ^           -- First match with b*, due to greedy quantifier bbabbbabbbbc    ^          -- [ac] matches, end of regex, a match is found

The subsequent matches will continue on from here.

172

answered Sep 18 '22 05:09

nhahtdh

I recently had to explain Atomic Groups to someone else and I thought I'd tweak and share the example here.

Consider /the (big|small|biggest) (cat|dog|bird)/

Matches in bold

the big dog
the small bird
the biggest dog
the small cat

DEMO

For the first line, a regex engine would find the . It would then proceed on to our adjectives (big, small, biggest), it finds big. Having matched big, it proceeds and finds the space. It then looks at our pets (cat, dog, bird), finds cat, skips it, and finds dog.

For the second line, our regex would find the . It would proceed and look at big, skip it, look at and find small. It finds the space, skips cat and dog because they don't match, and finds bird.

For the third line, our regex would find the , It continues on and finds big which matches the immediate requirement, and proceeds. It can't find the space, so it backtracks (rewinds the position to the last choice it made). It skips big, skips small, and finds biggest which also matches the immediate requirement. It then finds the space. It skips cat , and matches dog.

For the fourth line, our regex would find the . It would proceed to look at big, skip it, look at and find small. It then finds the space. It looks at and matches cat.

Consider /the (?>big|small|biggest) (cat|dog|bird)/

Note the ?> atomic group on adjectives.

Matches in bold

the big dog
the small bird
the biggest dog
the small cat

DEMO

For the first line, second line, and fourth line, we'll get the same result.

For the third line, our regex would find the , It continues on and find big which matches the immediate requirement, and proceeds. It can't find the space, but the atomic group, being the last choice the engine made, won't allow that choice to be re-examined (prohibits backtracking). Since it can't make a new choice, the match has to fail, since our simple expression has no other choices.

This is only a basic summary. An engine wouldn't need to look at the entirety of cat to know that it doesn't match dog, merely looking at the c is enough. When trying to match bird, the c in cat and the d in dog are enough to tell the engine to examine other options.

However if you had ...((cat|snake)|dog|bird), the engine would also, of course, need to examine snake before it dropped to the previous group and examined dog and bird.

There are also plenty of choices an engine can't decide without going past what may not seem like a match, which is what results in backtracking. If you have ((red)?cat|dog|bird), The engine will look at r, back out, notice the ? quantifier, ignore the subgroup (red), and look for a match.

answered Sep 18 '22 05:09

Regular Jo

Related questions
                            
                                Hash with indifferent access
                            
                                Rails ActiveSupport Time Parsing?
                            
                                How to set custom user-agent for Mechanize in Rails
                            
                                Kaminari & Rails pagination - undefined method `current_page'
                            
                                How to escape strings for terminal in Ruby?
                            
                                RVM, where is Ruby 3.0.0?
                            
                                How can I get a lazy array in Ruby?
                            
                                How to add a new action to the existing controller?
                            
                                Rails 3 validates inclusion of when using a find (how to proc or lambda)
                            
                                Array Attribute for Ruby Model
                            
                                Time manipulation in ruby
                            
                                How to order files by last modified time in ruby?
                            
                                is ruby on rails (or at least the community) dying? [closed]
                            
                                Ruby: Compare 2 arrays for matches, and count the number of match instances
                            
                                Ruby: Variables defined in If/else statement are accessible outside of if/else? [duplicate]
                            
                                Is it a bad idea do divide the models into directories?
                            
                                How do I set an attr_accessor for a dynamic instance variable?
                            
                                Convert Array of objects to Hash with a field as the key
                            
                                bundler/setup (LoadError)
                            
                                Remove character from string if it starts with that character?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With