Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are regex atomic groups distributive?

Tags:

regex

Are regex atomic groups distributive?

I.e. is (?>A?B?) always equivalent to (?>A?)(?>B?)?

If not please provide a counter example.

like image 436
o17t H1H' S'k Avatar asked May 30 '12 20:05

o17t H1H' S'k


2 Answers

Atomic groups in general

  1. The atomic group (?>regex1|regex2|regex3) takes only the first successful match within it. In other words, it doesn't allow backtracking.

  2. Regexes are evaluated left-to-right, so you express the order you intend things to match. The engine starts at the first position, trying to make a successful match, backtracking if necessary. If any path through the expression would lead to a successful match, then it will match at that position.

  3. Atomic groups are not distributive. Consider these patterns evaluated over ABC: (?>(AB?))(?>(BC)) (no match) and (?>(AB?)(BC)) (matches ABC).

Atomic Groups with all optional components

But, your scenario where both parts are optional may be different.

Considering an atomic group with 2 greedy optional parts A and B ((A)? and (B)?). At any position, if A matches, it can move on to evaluate the optional B. Otherwise, if A doesn't match, that's fine, too because it's optional. Therefore, (A)? matches at any position. The same logic applies for the optional B. The question remaining is whether there can be any difference in backtracking.

In the case of all optional parts ((?>A?B?)), since each part always matches, there's no reason to backtrack within the atomic group, so it will always match. Then, since it is in an atomic group, it is prohibited from backtracking.

In the case of separate atomic groups ((?>A?)(?>B?)), each part always matches, and the engine is prohibited from backtracking in either case. This means the results will be the same.

To reiterate, the engine can only use the first possible match in (?>A?)(?>B?), which will always be the same match as the first possible match in (?>A?B?). Thus, if my reasoning is correct,for this special case, the matches will be the same for multiple optional atomic groups as a single atomic group with both optional components.

like image 178
agent-j Avatar answered Nov 20 '22 22:11

agent-j


Since you didn't specify, I'll assume you're referring to Perl regexes, since I haven't seen the (?>) grouping operator in any other language.

Consider the following:

ra = 'A?'
rb = 'B?'

/(?>${ra} ${rb})/x is the same as/(?>${ra})(?>${rb})/x.

In this case, yes, it works either way; however, because (?>) disables backtracking, this is not the case with some other values of ra and rb.

For example, given:

ra = 'A*'
rb = 'AB*'

/(?>${ra} ${rb})/x != /(?>${ra})(?>${rb})/x.

In the latter, rb could never match, since ra would consume an entire sequence of A's, and would not allow backtracking. Note that this would work if we used (?:) as the grouping operator. Note also, that if we used capture groups (), then the match would be the same, but the side effects (assignment to \1, \2, ...) would be different.

like image 41
jpaugh Avatar answered Nov 20 '22 21:11

jpaugh