Is there a defined behavior for how regular expressions should handle the capturing behavior of nested parentheses? More specifically, can you reasonably expect that different engines will capture the outer parentheses in the first position, and nested parentheses in subsequent positions? Consider the following PHP code (using PCRE regular expressions) <pre class="prettyprint"><code><?php $test_string = 'I want to test sub patterns'; preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches); print_r($matches); ?> Array ( [0] => I want to test sub patterns //entire pattern [1] => I want to test //entire outer parenthesis [2] => want //first inner [3] => to //second inner [4] => patterns //next parentheses set ) </code></pre> The entire parenthesized expression is captured first (I want to test), and then the inner parenthesized patterns are captured next ("want" and "to"). This makes logical sense, but I could see an equally logical case being made for first capturing the sub parentheses, and THEN capturing the entire pattern. So, is this "capture the entire thing first" defined behavior in regular expression engines, or is it going to depend on the context of the pattern and/or the behavior of the engine (PCRE being different than C#'s being different than Java's being different than etc.)?

From perlrequick <blockquote> If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc. </blockquote> Caveat: Excluding non-capture group opening parenthesis (?=) Update I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's: <blockquote> SUBPATTERNS <code>2.</code> It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the <code>ovector</code> argument of <code>pcre_exec()</code>. Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns. For example, if the string "the red king" is matched against the pattern <pre class="prettyprint"><code>the ((red|white) (king|queen)) </code></pre> the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively. </blockquote> If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.

How are nested capturing groups numbered in regular expressions?

Tags:

java

language-agnostic

.net

regex

perl

Is there a defined behavior for how regular expressions should handle the capturing behavior of nested parentheses? More specifically, can you reasonably expect that different engines will capture the outer parentheses in the first position, and nested parentheses in subsequent positions?

Consider the following PHP code (using PCRE regular expressions)

<?php   $test_string = 'I want to test sub patterns';   preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches);   print_r($matches); ?>  Array (     [0] => I want to test sub patterns  //entire pattern     [1] => I want to test           //entire outer parenthesis     [2] => want             //first inner     [3] => to               //second inner     [4] => patterns             //next parentheses set )

The entire parenthesized expression is captured first (I want to test), and then the inner parenthesized patterns are captured next ("want" and "to"). This makes logical sense, but I could see an equally logical case being made for first capturing the sub parentheses, and THEN capturing the entire pattern.

So, is this "capture the entire thing first" defined behavior in regular expression engines, or is it going to depend on the context of the pattern and/or the behavior of the engine (PCRE being different than C#'s being different than Java's being different than etc.)?

871

asked Aug 21 '09 19:08

Alan Storm

1 Answers

From perlrequick

If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc.

Caveat: Excluding non-capture group opening parenthesis (?=)

Update

I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's:

SUBPATTERNS

2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec(). Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns.

For example, if the string "the red king" is matched against the pattern
the ((red|white) (king|queen)) 
the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.

If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.

117

answered Sep 24 '22 08:09

daotoad

Related questions
                            
                                Hibernate Auto Increment ID
                            
                                Better way to find index of item in ArrayList?
                            
                                How to convert a Reader to InputStream and a Writer to OutputStream?
                            
                                How to exclude property from Lombok builder?
                            
                                Getting Class type from String
                            
                                Java's createNewFile() - will it also create directories?
                            
                                rxjava: Can I use retry() but with delay?
                            
                                Convert hex color value ( #ffffff ) to integer value
                            
                                Thymeleaf: Concatenation - Could not parse as expression
                            
                                How to convert a byte array to its numeric value (Java)?
                            
                                Convert double to float in Java
                            
                                How do I add one month to current date in Java?
                            
                                Convert a RGB Color Value to a Hexadecimal String
                            
                                IntelliJ IDEA tells me "Error:java: Compilation failed: internal java compiler error idea"
                            
                                Aggregation versus Composition [closed]
                            
                                JDBC connection failed, error: TCP/IP connection to host failed
                            
                                Check string for palindrome
                            
                                Why would I use a templating engine? jsp include and jstl vs tiles, freemarker, velocity, sitemesh
                            
                                Vagrant for a Java project: should you compile in the VM or on the host?
                            
                                How to write Javadoc of properties?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With