Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the regular expression "(aa)+\1" match "aaaaaa"?

Can anyone explain the process that a regular expression engine matches (aa)+\1 against aaaaaa? I know there is a process called backtracking when you use + or * but I'm not sure how it works in this example.

like image 734
bodacydo Avatar asked Aug 24 '13 14:08

bodacydo


3 Answers

When you put the quantifier outside the capture group, it doesn't capture the entire string matched with that pattern with quantifier. It rather matches just the last repetition that pattern matched.

So, (aa)+ won't capture aaaa in capture group, but just the last pair - aa, so that it can satisfy the rest of the regex pattern.

So, with (aa)+\1, the pattern matches first - aaaa, and then the backreference \1 matches the captured group - aa. Thus is matches the string - aaaaaa. Not (aa)+ won't match all the a's, because then there will be nothing left over to be matched by \1.

Here's the break up of the regex (aa)+\1:

  • (aa)+ matches the first two aa in the string. Remaining string - aaaa.
  • There is more to be matched by (aa)+, so it goes on to match next aa. Remaining string - aa.
  • Again (aa)+ can match the remaining string. So it matches the next aa. Remaining string - "". Remember, the quantifiers by default act greedy. They will match as much as they can.
  • Now, (aa)+ can't match any further.
  • Next in the pattern is \1. But there is nothing left to match.
  • Backtrack the last pattern matched by (aa)+. Remaining string - "aa".
  • Now \1 again tries to match, and it successfully matches aa, since that is what currently in the 1st capture group.

References:

  • Regular-Expressions.info - Catastrophic Backtracking
like image 75
Rohit Jain Avatar answered Nov 10 '22 15:11

Rohit Jain


The + quantifier means "1 or more". The \1 refers to the captured group, which is the same thing the quantifier is referring to. So effectively, it's saying "group aa, 1 or more times, and then one more time". Which is the same as "2 or more times".

So the regex might be clearer as this: /(aa){2,}/

Since aaaaaa is three sets of the aa group, the regex matches the string.

like image 21
Niet the Dark Absol Avatar answered Nov 10 '22 15:11

Niet the Dark Absol


Scenario:

aa           # the group is matched
aaaa         # the group is repeated once, cause the + quantifier
aaaaaa       # the group is repeated once again, always cause 
             # the + quantifier (and because it is greedy and take all it can.)
             # But since all the characters are eaten, and there is \1
             # the pattern will fail.
aaaa         # the regex engine must backtrack to try another way because of \1
aaaaaa       # you are arrived! (the 2 last "a" are for the \1

You can verify this behaviour using a possessive quantifier (++) that forbid backtracks:

(aa)++\1            # will never match
like image 30
Casimir et Hippolyte Avatar answered Nov 10 '22 16:11

Casimir et Hippolyte