I am experimenting with the named subpattern/'subroutine' regex features in PHP's PCRE and I'm hoping someone can explain the following strange output:
$re = "/
(?(DEFINE)
(?<a> a )
)
^(?&a)$
/x";
var_dump(preg_match($re, 'a', $match)); // (int) 1 as expected
var_dump($match); // Array( [0] => 'a' ) <-- Why?
I can't understand why the named group "a" is not in the result (with the contents "a"). Changing preg_match
to preg_match_all
puts "a" and "1" in the match data but both contain only an empty string.
I really like the idea of writing regular expressions this way, as you can make them incredibly powerful whilst keeping them very maintainable (see this answer for a good example of this), however if the subpatterns are not available in the match data then it's not much use really.
Am I missing something here or should I just mourn what could have been and move on?
PCRE (Perl Compatible Regular Expressions) is a C library implementing regex. It was written in 1997 when Perl was the de-facto choice for complex text processing tasks. The syntax for patterns used in PCRE closely resembles Perl. PCRE syntax is being used in many big projects including PHP, Apache, R to name a few.
PCRE tries to match Perl syntax and semantics as closely as it can. PCRE also supports some alternative regular expression syntax (which does not conflict with the Perl syntax) in order to provide some compatibility with regular expressions in Python, .
The [] construct in a regex is essentially shorthand for an | on all of the contents. For example [abc] matches a, b or c. Additionally the - character has special meaning inside of a [] . It provides a range construct. The regex [a-z] will match any letter a through z.
Ruby regular expressions are PCRE regular expressions as PCRE stands for Perl Compatible Regular Expressions and defines a particular syntax for the regular expressions it supports.
It makes perfect sense these subpatterns would not capture a group - their main purpose it to be used more than once, so you can't really capture them all. In addition, if the default was to capture all subpatterns it wouldn't give you an option not to capture a group where you don't want it - not the best default behavior. The opposite is trivial - you can capture by adding another group around the (?&a)
statement.
I couldn't find a reference to this on PCRE.org. The closest is this, which is relevant because you don't match (?<a>...)
directly (though you might expect an empty group):
Any capturing parentheses that are set during the subroutine call revert to their previous values afterwards.
It is clearer on the Perl manual (relevant part highlighted):
An example of how this might be used is as follows:
/(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT)) (?(DEFINE) (?<NAME_PAT>....) (?<ADRESS_PAT>....) )/x
Note that capture buffers matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing buffers is necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With