EDIT: I selected ridgerunner's answer as it contained the information needed to solve the problem. But I also felt like adding a fully fleshed-out solution to the specific question in case someone else wants to fully understand the example too. You will find it somewhere below.
This question is about clarifying the behavior of php's regex engine for recursive expressions. (If you ideas for how to properly match the strings below without using recursive php regex, that's very cool, but that's not the question.)
a(?:(?R)|a?)a
This is a simple expression that aims to match the character "a" or nothing, nested in one or multiple nests of the character "a". For instance, aa, aaa, aaaa, aaaaa. You don't need to use recursion for this:
aa*a
would work great. But the point is to use recursion.
Here is a piece of code you can run to test my failing pattern:
<?php $tries=array('a','aa','aaa','aaaa','aaaaa','aaaaaa'); $regex='#a(?:(?R)|a?)a#'; foreach ($tries as $try) { echo $try." : "; if (preg_match($regex,$try,$hit)) echo $hit[0]."<br />"; else echo 'no match<br />'; } ?>
In the pattern, two "a"s are framing an alternation. In the alternation, we either match a recursion of the whole pattern (two "a"s framing an alternation), or the character "a", optionally empty.
In my mind, for "aaaa", this should match "aaaa".
But here is the output:
a : no match aa : aa aaa : aaa aaaa : aaa aaaaa : aaaaa aaaaaa : aaa
Can someone explain what is happening on the third and fifth lines of output? I have tried tracing the path that I imagine the engine must be taking, but I must be imagining it wrong. Why is the engine returning "aaa" as a match for "aaaa"? What makes it so eager? I must be imagining the matching tree in the wrong order.
I realise that
#(?:a|a(?R)a)*#
kind of works, but my question is why the other pattern doesn't.
Thanks heaps!
Excellent (and difficult) question!
First, with the PCRE regex engine, the (?R)
behaves like an atomic group (unlike Perl?). Once it matches (or doesn't match), the matching that happened inside the recursive call is final (and all backtracking breadcrumbs saved within the recursive call are discarded). However, the regex engine does save what was matched by the whole (?R)
expression, and can give it back and try the other alternative to achieve an overall match. To describe what is happening, lets change your example slightly so that it will be easier to talk about and keep track of what is being matched at each step. Instead of: aaaa
as the subject text, lets use: abcd
. And lets change the regex from '#a(?:(?R)|a?)a#'
to: '#.(?:(?R)|.?).#'
. The regex engine matching behavior is the same.
/.(?:(?R)|.?)./
to: "abcd"
answer = r''' Step Depth Regex Subject Comment 1 0 .(?:(?R)|.?). abcd Dot matches "a". Advance pointers. ^ ^ 2 0 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 1). ^ ^ 3 1 .(?:(?R)|.?). abcd Dot matches "b". Advance pointers. ^ ^ 4 1 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 2). ^ ^ 5 2 .(?:(?R)|.?). abcd Dot matches "c". Advance pointers. ^ ^ 6 2 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 3). ^ ^ 7 3 .(?:(?R)|.?). abcd Dot matches "d". Advance pointers. ^ ^ 8 3 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 4). ^ ^ 9 4 .(?:(?R)|.?). abcd Dot fails to match end of string. ^ ^ DEPTH 4 (?R) FAILS. Return to step 8 depth 3. Give back text consumed by depth 4 (?R) = "" 10 3 .(?:(?R)|.?). abcd Try 2nd alt. Optional dot matches EOS. ^ ^ Advance regex pointer. 11 3 .(?:(?R)|.?). abcd Required dot fails to match end of string. ^ ^ DEPTH 3 (?R) FAILS. Return to step 6 depth 2 Give back text consumed by depth3 (?R) = "d" 12 2 .(?:(?R)|.?). abcd Try 2nd alt. Optional dot matches "d". ^ ^ Advance pointers. 13 2 .(?:(?R)|.?). abcd Required dot fails to match end of string. ^ ^ Backtrack to step 12 depth 2 14 2 .(?:(?R)|.?). abcd Match zero "d" (give it back). ^ ^ Advance regex pointer. 15 2 .(?:(?R)|.?). abcd Dot matches "d". Advance pointers. ^ ^ DEPTH 2 (?R) SUCCEEDS. Return to step 4 depth 1 16 1 .(?:(?R)|.?). abcd Required dot fails to match end of string. ^ ^ Backtrack to try other alternative. Give back text consumed by depth 2 (?R) = "cd" 17 1 .(?:(?R)|.?). abcd Optional dot matches "c". Advance pointers. ^ ^ 18 1 .(?:(?R)|.?). abcd Required dot matches "d". Advance pointers. ^ ^ DEPTH 1 (?R) SUCCEEDS. Return to step 2 depth 0 19 0 .(?:(?R)|.?). abcd Required dot fails to match end of string. ^ ^ Backtrack to try other alternative. Give back text consumed by depth 1 (?R) = "bcd" 20 0 .(?:(?R)|.?). abcd Try 2nd alt. Optional dot matches "b". ^ ^ Advance pointers. 21 0 .(?:(?R)|.?). abcd Dot matches "c". Advance pointers. ^ ^ SUCCESSFUL MATCH of "abc" '''
There is nothing wrong with the regex engine. The correct match is abc
(or aaa
for the original question.) A similar (albeit much longer) sequence of steps can be made for the other longer result string in question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With