Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are nested capturing groups numbered in regular expressions?

Is there a defined behavior for how regular expressions should handle the capturing behavior of nested parentheses? More specifically, can you reasonably expect that different engines will capture the outer parentheses in the first position, and nested parentheses in subsequent positions?

Consider the following PHP code (using PCRE regular expressions)

<?php   $test_string = 'I want to test sub patterns';   preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches);   print_r($matches); ?>  Array (     [0] => I want to test sub patterns  //entire pattern     [1] => I want to test           //entire outer parenthesis     [2] => want             //first inner     [3] => to               //second inner     [4] => patterns             //next parentheses set ) 

The entire parenthesized expression is captured first (I want to test), and then the inner parenthesized patterns are captured next ("want" and "to"). This makes logical sense, but I could see an equally logical case being made for first capturing the sub parentheses, and THEN capturing the entire pattern.

So, is this "capture the entire thing first" defined behavior in regular expression engines, or is it going to depend on the context of the pattern and/or the behavior of the engine (PCRE being different than C#'s being different than Java's being different than etc.)?

like image 871
Alan Storm Avatar asked Aug 21 '09 19:08

Alan Storm


People also ask

How do Capturing groups work in regex?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .

How do I reference a capture group in regex?

If your regular expression has named capturing groups, then you should use named backreferences to them in the replacement text. The regex (?' name'group) has one group called “name”. You can reference this group with ${name} in the JGsoft applications, Delphi, .

When capturing regex groups what datatype does the groups method return?

The re. groups() method This method returns a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.

Can regex be nested?

No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.


1 Answers

From perlrequick

If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc.

Caveat: Excluding non-capture group opening parenthesis (?=)

Update

I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's:

SUBPATTERNS

2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec(). Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns.

For example, if the string "the red king" is matched against the pattern

the ((red|white) (king|queen)) 

the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.

If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.

like image 117
daotoad Avatar answered Sep 24 '22 08:09

daotoad