This is a followup to this question.
Have a look at this pattern:
(o(?1)?o)
It matches any sequence of o
with a length of 2n, with n ≥ 1.
It works, see regex101.com (word boundaries added for better demonstration).
The question is: Why?
In the following, the description of a string (match or not) will simply be a bolded number or a bolded term that describes the length, like 2n.
Broken down (with added whitespaces):
( o (?1)? o )
( ) # Capture group 1
o o # Matches an o each at the start and the end of the group
# -> the pattern matches from the outside to the inside.
(?1)? # Again the regex of group 1, or nothing.
# -> Again one 'o' at the start and one at the end. Or nothing.
I don't understand why this doesn't match 2n, but 2n, because I would describe the pattern as *an undefined number of o o
, stacked into each other.
Visualization:
No recursion, 2 is a match:
oo
One recursion, 4 is a match:
o o
oo
So far, so easy.
Two recursions. Obviously wrong because the pattern does not match 6:
o o
o o
oo
But why? It seems to fit the pattern.
I conclude that it's not simply the plain pattern that is repeated because otherwise 6 would have to match.
But according to regular-expressions.info:
(?P<name>[abc])(?1)(?P>name)
matches three letters like(?P<name>[abc])[abc][abc]
does.
and
[abc])(?1){3}
[...] is equivalent to([abc])[abc]{3}
So it does seem to simply rematch the regex code without an information about the previous match of the capture group.
Can someone explain and maybe visualize why this pattern matches 2n and nothing else?
Edit:
It was mentioned in the comments:
I doubt that referencing a capture group inside of itself is actually a supported case.
regular-expressions.info does mention the technique:
If you place a call inside the group that it calls, you'll have a recursive capturing group.
A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.
Recursion is mostly used to match balanced constructs. The regex \([^()]*+(?:(? R)[^()]*+)*+\) matches a pair of parentheses with all parentheses between them correctly nested, regardless of how many nested pairs there are or how deeply they are nested. This regex satisfies both requirements for valid recursion.
Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1. 1* means any number of ones.
A Regular Expression can be recursively defined as follows − ε is a Regular Expression indicates the language containing an empty string. ( L (ε) = {ε}) φ is a Regular Expression denoting an empty language. (
You understand recursion correctly. Word boundaries baffle you here. The \b
around the pattern require the regex engine to only match the string if it is not preceded and followed with word chars.
See how the recursion goes here:
( o (?1)? o ) => oo
(?1)
is then replaced with (o(?1)?o)
:
( o (?>o(?1)?o)? o ) => oo or oooo
Then again:
(o (?>o(?>o(?1)?o)?o)? o) => oo, oooo, oooooo
See the regex demo without word boundaries.
Why adding (?>...)
in the example above? Each recursion level in PHP recursive regexes is atomic, unlike Perl, and once a preceding level fails, the engine does not go back to the following one.
When you add word boundaries, the first o
and last o
matched cannot have any other word chars before/after. So, ooo
won't match then.
See Recursive Regular Expressions explained step by step and Word Boundary: \b
at rexegg.com, too.
Why does
oooooo
not get matched as a whole but asoooo
andoo
?
Again, each recursion level is atomic. oooooo
is matched like this:
(o(?1)?o)
matches the first o
(?1)?
gets expanded and the pattern is now (o(?>o(?1)?o)?o)
and it matches the second o
in the input(o(?>o(?>o(?>o(?>o(?>o(?>o(?1)?o)?o)?o)?o)?o)?o)?o)
that does not match the input any longer, backtracking happens, we go to the 6th level, o
so
s.See the regex debugger:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With