Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why will this recursive regex only match when a character repeats 2^n - 1 times?

After reading polygenelubricants's series of articles on advanced regular expressions techniques (particularly How does this Java regex detect palindromes?), I decided to attempt to create my own PCRE regex to parse a palindrome, using recursion (in PHP).

What I came up with was:

^(([a-z])(?1)\2|[a-z]?)$

My understanding of this expression is that it should either match zero or one characters (every string of less than 2 characters is implicitly a palindrome, as well as to account for palindromes with odd lengths in the recursion), or two of the same character separated by a recursion of the pattern.

Unfortunately, it does not seem to be working that way, as you can see at www.ideone.com/a9T3F. Instead, only the strings of 2n - 1 (ie. empty string, a, aaa, aaaaaaa, a15) repeated characters match the regular expression.

Oddly, if I modify my pattern so that the recursion is optional (ie. ^(([a-z])(?1)?\2|[a-z]?)$, see www.ideone.com/D6lJR, it only matches strings with a character repeated 2n times (ie. empty string, a, aa, aaaa, aaaaaaaa, a16).

Why is my regex not working the way I expect it to?

Note for the people who are itching to suggest not to use regex:
The point of this question is to learn how to use recursive regular expressions properly. I know that this is not an effective way to determine if a string is a palindrome, and I wouldn't use a recursive regex if I for some reason had to determine palindromes in production code; I am just interested in learning more about the advanced aspects of regex.

like image 545
Daniel Vandersluis Avatar asked Sep 17 '10 20:09

Daniel Vandersluis


People also ask

How do you repeat in regex?

A repeat is an expression that is repeated an arbitrary number of times. An expression followed by '*' can be repeated any number of times, including zero. An expression followed by '+' can be repeated any number of times, but at least once.

Which regex matches one or more digits?

+: one or more ( 1+ ), e.g., [0-9]+ matches one or more digits such as '123' , '000' . *: zero or more ( 0+ ), e.g., [0-9]* matches zero or more digits. It accepts all those in [0-9]+ plus the empty string.

Why * is used in regex?

*. * , returns strings beginning with any combination and any amount of characters (the first asterisk), and can end with any combination and any amount of characters (the last asterisk). This selects every single string available.

Which is faster for loop or regex?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement. If you are using regex as to match very small text and small pattern and don't do it because the matcher function .


1 Answers

The phenomenon you're observing is due to the fact that PCRE subpattern recursion is atomic, unlike Perl. The man page actually covers this problem in great detail:

In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure.

This can be illustrated by the following pattern, which purports to match a palindromic string that contains an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):

    ^(.|(.)(?1)\2)$

The idea is that it either matches a single character, or two identical characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE it does not if the pattern is longer than three characters.

Consider the subject string "abcba":

At the top level, the first character is matched, but as it is not at the end of the string, the first alternative fails; the second alternative is taken and the recursion kicks in. The recursive call to subpattern 1 successfully matches the next character ("b"). (Note that the beginning and end of line tests are not part of the recursion).

Back at the top level, the next character ("c") is compared with what subpattern 2 matched, which was "a". This fails. Because the recursion is treated as an atomic group, there are now no backtracking points, and so the entire match fails. (Perl is able, at this point, to re- enter the recursion and try the second alternative.) However, if the pattern is written with the alternatives in the other order, things are different:

    ^((.)(?1)\2|.)$

This time, the recursing alternative is tried first, and continues to recurse until it runs out of characters, at which point the recursion fails. But this time we do have another alternative to try at the higher level. That is the big difference: in the previous case the remaining alternative is at a deeper recursion level, which PCRE cannot use.

To change the pattern so that matches all palindromic strings, not just those with an odd number of characters, it is tempting to change the pattern to this:

    ^((.)(?1)\2|.?)$

Again, this works in Perl, but not in PCRE, and for the same reason. When a deeper recursion has matched a single character, it cannot be entered again in order to match an empty string. The solution is to separate the two cases, and write out the odd and even cases as alternatives at the higher level:

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))$

WARNING!!!

The palindrome-matching patterns above work only if the subject string does not start with a palindrome that is shorter than the entire string. For example, although "abcba" is correctly matched, if the subject is "ababa", PCRE finds the palindrome "aba" at the start, then fails at top level because the end of the string does not follow. Once again, it cannot jump back into the recursion to try other alternatives, so the entire match fails.

Additional references

  • regular-expressions.info/Atomic grouping
    • (?>…) in some flavor is atomic grouping syntax
    • Lookarounds (?=…), (?!…), (?<=…), (?<!…), are all atomic
    • Possessive quantifier (e.g. a*+) is also atomic
    • PCRE recursive subpattern and subroutine calls are also atomic

A closer look at the pattern

The atomicity argument is correct, but perhaps it's not obvious how it explains why the pattern behaves as observed. Let's take a closer look and see how this all fits:

We will use the first pattern:

^(([a-z])(?1)\2|[a-z]?)$

I will use the following notation to denote the recursion:

  • 1 means the character was captured into group 2 in the first alternate
  • 2 means the character was matched by the second alternate
    • Or if the 2 is not above a character, the zero repetition option of ? is exercised
  • \ means the character was matched by the backreference to group 2 in first alternate
  • _ denotes the bottom of a recursive branch
    • This branch will NOT be reentered even if there are other alternatives!

Now let's consider "aaa" as input:

      _
1 1 1 2 
a a a   # This is the first bottom of the recursion,
        # now we go back to the third 1 and try to match \.
        # This fails, so the third 1 becomes 2.
    _
1 1 2
a a a   # Now we go back to the second 1 and try to match \.
        # This fails, so the second 1 becomes 2.
  _
1 2
a a a   # The second level matched! now we go back to the first level...

_____
1 2 \
a a a   # Now the first 1 can match \, and entire pattern matches!!

Now consider "aaaaa":

          _
1 1 1 1 1 2
a a a a a  # Fifth 1 can't match \, so it becomes 2. 
        _
1 1 1 1 2
a a a a a  # Fourth 1 can't match \, so it becomes 2.
    _____
1 1 1 2 /
a a a a a  # Here's a crucial point. The third 1 successfully matched.
           # Now we're back to the second 1 and try to match \, but this fails.
           # However, since PCRE recursion is atomic, the third 1 will NOT be
           # reentered to try 2. Instead, we try 2 on the second 1.
_____
1 2 \
a a a a a  # Anchors don't match, so the first 1 becomes 2, and then also the
           # anchors don't match, so the pattern fails to match.

Note that once a recursion level matches on the first alternative, the second alternative will not be attempted in the future (even if doing so may result in a may match), because PCRE subpattern recursion is atomic.


Now consider "aa":

    _
1 1 2 
a a
  _
1 2
a a  # The second level matched by taking the one repetition option on ?.
     # We now go back to the first level, and we can't match \.
     # Since PCRE recursion is atomic, we can't go back to the second level
     # to try the zero repetition option on ?.
_    
2
a a  # Anchors don't match, trying zero option on ? also doesn't help,
     # so the pattern fails to match!

Note that once a recursion level matches on the one repetition of the ? on the second alternative, the zero repetition option will not be attempted in the future (even if doing so may result in a may match), because PCRE subpattern recursion is atomic.


Now let's consider aaaaaaa

              _
1 1 1 1 1 1 1 2  
a a a a a a a 
            _
1 1 1 1 1 1 2  
a a a a a a a 
        _____
1 1 1 1 1 2 \  
a a a a a a a  # A crucial point: the fifth level matched and now the fourth
               # level can't match \, but it does NOT reenter the fifth level to
               # try 2. Instead, the fourth level tries 2.
    _____    
1 1 1 2 \  
a a a a a a a 
  _________    
1 1 1 2 \ \ 
a a a a a a a 
_____________    
1 1 1 2 \ \ \
a a a a a a a  # Entire pattern is a match! 

Note that even though PCRE subpattern recursion is atomic, it can still successfully match a palindrome consisting of a character repeating 2n-1 times.


Now, just for fun, let's try "abcba":

          _
1 1 1 1 1 2
a b c b a
        _
1 1 1 1 2
a b c b a

1 1 1 2 
a b c b a   # Third level attempts \, but c does not match a!
            # So we go back to third 1 and try 2.
  _____
1 1 2 \ 
a b c b a 
_________
1 1 2 \ \
a b c b a   # Entire pattern is a match!

That is, the pattern doesn't just match "only when a character repeats 2n-1 times". It can indeed match "abcba" (as seen on ideone.com). It can NOT, however, match "ababa", nor can it match "aaaaa" (see the WARNING on the man page!), because subpattern recursion in PCRE is atomic.

You can apply this same tracing process to explain the behavior of the pattern on any input.

like image 144
polygenelubricants Avatar answered Nov 02 '22 16:11

polygenelubricants