Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does this PCRE pattern detect palindromes?

Tags:

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.

Examine this PCRE pattern in PHP snippet:

$palindrome = '/(?x)
^
  (?:
      (.) (?=
              .*
              (
                \1
                (?(2) \2 | )
              )
              $
          )
  )*
  .?
  \2?
$


/';

This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):

$tests = array(
  # palindromes
  '',
  'a',
  'aa',
  'aaa',
  'aba',
  'aaaa',
  'abba',
  'aaaaa',
  'abcba',
  'ababa',

  # non-palindromes
  'aab',
  'abab',
  'xyz',
);

foreach ($tests as $test) {
  echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);  
}

So how does this pattern work?


Notes

This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).

Also, note that the PCRE man page presents a recursive pattern to match some palindromes:

# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$

The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.

like image 948
polygenelubricants Avatar asked Sep 19 '10 16:09

polygenelubricants


2 Answers

Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:

^(.)(.)(.) ... \3\2\1$

we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:

^(.)(?=.*\1$)
 (.)(?=.*\2\1$)
 (.)(?=.*\3\2\1$) ...

but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:

^(.)(?=.*(?<record>\1\k<record>)$)   # \1     = \1 + (empty)
 (.)(?=.*(?<record>\2\k<record>)$)   # \2\1   = \2 + \1
 (.)(?=.*(?<record>\3\k<record>)$)   # \3\2\1 = \3 + \2\1
 ...

which could be converted into

^(?: 
    (.)(?=.*(\1\2)$)
 )*

Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.

(?(2)\2|)   # matches \2 if it exist, empty otherwise.

so our expression becomes

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*

Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*\2$

We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2$

This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2?$
#      ^ since \2 must be at the end in the look-ahead anyway.
like image 130
kennytm Avatar answered Oct 02 '22 13:10

kennytm


I want to bring my very own solution to the table. This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2

^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$

Example: https://regex101.com/r/xvZ1H0/1

like image 36
horsefez Avatar answered Oct 02 '22 14:10

horsefez