Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does regular expression engine parse regex with recursive subpatterns?

This regular expression matches palindromes: ^((.)(?1)\2|.?)$

Can't wrap my head around how it works. When does the recursion end, and when regex breaks from the recursive subpattern and goes to "|.?" part?

Thanks.

edit: sorry I didn't explain \2 and (?1)

(?1) - refers to first subpattern (to itself)

\2 - back-reference to a match of second subpattern, which is (.)

Above example written in PHP. Matches both "abba" (no mid palindrome character) and "abcba" - has a middle, non-reflected character

like image 802
alexy2k Avatar asked Jul 26 '12 13:07

alexy2k


People also ask

How does regex parsing work?

Regular expression parsing makes finding matches of a regular expression even more useful by allowing us to directly extract subpatterns of the match, e.g., for extracting IP-addresses from internet traffic analysis or extracting subparts of genomes from genetic data bases.

What is regex engine?

A regex engine executes the regex one character at a time in left-to-right order. This input string itself is parsed one character at a time, in left-to-right order. Once a character is matched, it's said to be consumed from the input, and the engine moves to the next input character. The engine is by default greedy.

What is the use of given statement in regular expression a za Z?

Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.

What will the regular expression match?

By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.


Video Answer


2 Answers

^((.)(?1)\2|.?)$

The ^ and $ asserts the beginning and the end of the string respectively. Let us look at the content in between, which is more interesting:

((.)(?1)\2|.?)
1------------1 // Capturing group 1
 2-2           // Capturing group 2

Look at the first part (.)(?1)\2, we can see that it will try to match any character, and that same character at the end (back reference \2, which refers to the character matched by (.)). In the middle, it will recursively match for the whole capturing group 1. Note that there is an implicit assertion (caused by (.) matching one character at the beginning and \2 matching the same character at the end) that requires the string to be at least 2 characters. The purpose of the first part is chopping the identical ends of the string, recursively.

Look at second part .?, we can see that it will match one or 0 character. This will only be matched if the string initially has length 0 or 1, or after the leftover from the recursive match is 0 or 1 character. The purpose of the second part is to match the empty string or the single lonely character after the string is chopped from both ends.

The recursive matching works:

  • The whole string must be palindrome to pass, asserted by ^ and $. We cannot start matching from any random position.
  • If the string is <= 1 character, it passes.
  • If the string is > 2 characters, whether it is accepted is decided by the first part only. And it will be chopped by 2 ends if matches.
  • The leftover if matches, can only be chopped by the 2 ends, or passes if its length is <= 1.
like image 114
nhahtdh Avatar answered Sep 19 '22 17:09

nhahtdh


The regex is essentially equivalent to the following pseudo-code:

palin(str) {
    if (length(str) >= 2) {
      first = str[0];
      last = str[length(str)-1];
      return first == last && palin(substr(str, 1, length(str)-2));
    } else
      // empty and single-char trivially palindromes
      return true;
}
like image 38
Barmar Avatar answered Sep 20 '22 17:09

Barmar