Can anyone explain the process that a regular expression engine matches (aa)+\1
against aaaaaa
? I know there is a process called backtracking when you use +
or *
but I'm not sure how it works in this example.
When you put the quantifier outside the capture group, it doesn't capture the entire string matched with that pattern with quantifier. It rather matches just the last repetition that pattern matched.
So, (aa)+
won't capture aaaa
in capture group, but just the last pair - aa
, so that it can satisfy the rest of the regex pattern.
So, with (aa)+\1
, the pattern matches first - aaaa
, and then the backreference \1
matches the captured group - aa
. Thus is matches the string - aaaaaa
. Not (aa)+
won't match all the a's
, because then there will be nothing left over to be matched by \1
.
Here's the break up of the regex (aa)+\1
:
(aa)+
matches the first two aa
in the string. Remaining string - aaaa
.(aa)+
, so it goes on to match next aa
. Remaining string - aa
.(aa)+
can match the remaining string. So it matches the next aa
. Remaining string - ""
. Remember, the quantifiers by default act greedy. They will match as much as they can.(aa)+
can't match any further. \1
. But there is nothing left to match.(aa)+
. Remaining string - "aa"
.\1
again tries to match, and it successfully matches aa
, since that is what currently in the 1st capture group.References:
The +
quantifier means "1 or more". The \1
refers to the captured group, which is the same thing the quantifier is referring to. So effectively, it's saying "group aa, 1 or more times, and then one more time". Which is the same as "2 or more times".
So the regex might be clearer as this: /(aa){2,}/
Since aaaaaa
is three sets of the aa
group, the regex matches the string.
Scenario:
aa # the group is matched
aaaa # the group is repeated once, cause the + quantifier
aaaaaa # the group is repeated once again, always cause
# the + quantifier (and because it is greedy and take all it can.)
# But since all the characters are eaten, and there is \1
# the pattern will fail.
aaaa # the regex engine must backtrack to try another way because of \1
aaaaaa # you are arrived! (the 2 last "a" are for the \1
You can verify this behaviour using a possessive quantifier (++) that forbid backtracks:
(aa)++\1 # will never match
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With