Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching a^n b^n c^n for n > 0 with PCRE

Tags:

regex

pcre

How would you match a^n b^n c^n for n > 0 with PCRE?

The following cases should match:

abc
aabbcc
aaabbbccc

The following cases should not match:

abbc
aabbc
aabbbccc

Here's what I've "tried"; /^(a(?1)?b)$/gmx but this matches a^n b^n for n > 0:

ab
aabb
aaabbb

Online demo

Note: This question is the same as this one with the change in language.

like image 393
HamZa Avatar asked Apr 25 '15 14:04

HamZa


People also ask

What does PCRE matching do?

PCRE tries to match Perl syntax and semantics as closely as it can. PCRE also supports some alternative regular expression syntax (which does not conflict with the Perl syntax) in order to provide some compatibility with regular expressions in Python, . NET, and Oniguruma.

What PCRE Perl Compatible regular expressions matching does?

The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5. PCRE has its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API.

What is the use of \\ in regex?

You also need to use regex \\ to match "\" (back-slash). Regex recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.

What is matching regexp?

The regexp-match-positions function takes a regexp pattern and a text string, and it returns a match if the regexp matches (some part of) the text string, or #f if the regexp did not match the string. A successful match produces a list of index pairs.


1 Answers

Qtax trick

(The mighty self-referencing capturing group)

^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1\2$

This solution is also named "The Qtax trick" because it uses the same technique as from "vertical" regex matching in an ASCII "image" by Qtax.


The problem in question burns down to a need to assert that three groups are matched of the same length. As a simplified version, to match:

xyz

Where x, y and z are really just subpatterns with a variable with matching length n of a, b and c. With an expression that uses lookaheads with self-referencing capturing groups, a character we specify is added to each repetition of the lookahead, which can effectively be used to "count":

aaabbbccc
 ^  ^  ^

This is achieved by the following:

  • (?:a…)+ A character of subpattern a is matched. With (?=a*, we skip directly to the "counter".
  • (\1?+b) Capturing group (\1) effectively consumes whatever has previously been matched, if it is there, and uses a possessive match which does not permit backtracking, and the match fails if the counter goes out of sync - That is, there has been more of subpatterns b than subpattern a. On the first iteration, this is absent, and nothing is matched. Then, a character of subpattern b is matched. It is added to the capturing group, effectively "counting" one of b in the group. With b*, we skip directly to the next "counter".
  • (\2?+c) Capturing group (\2) effectively consumes whatever has previously been matched just like the above. Because this additional character capture works just like the previous group, characters are allowed to sync up in length within these character groups. Assuming continuous sequences of a..b..c..:

(Excuse my art.)

First iteration:

| The first 'a' is matched by the 'a' in '^(?:a…)'.
| The pointer is stuck after it as we begin the lookahead.
v,- Matcher pointer
aaaa...bbbbbbbb...cccc...
 ^^^   |^^^       ^
skipped| skipped  Matched by c in (\2?+c);
by a*  | by b*         \2 was "nothing",
       |               now it is "c".
       Matched by b
       in (\1?+b).
     \1 was "nothing", now it is "b".

Second iteration:

 | The second 'a' is matched by the 'a' in '^(?:a…)'.
 | The pointer is stuck after it as we begin the lookahead.
 v,- Matcher pointer
aaaa...bbbbbbbb...cccc...
       /|^^^      |^
eaten by| skipped |Matched by c in (\2?+c);
\1?+    | by b*   |     '\2' was "nothing",
  ^^    |      \2?+     now it is "cc".
 skipped|
 by a*  \ Matched by b
          in (\1?+b).
          '\1' was "nothing", now it is "bb".

As the three groups discussed above "consumes" one of each of a, b, c respectively, they are matched in round-robin style and "counted" by the (?:a…)+, (\1?+b) and (\2?+c) groups respectively. With the additional anchoring and capturing what we started, we can assert that we match xyz (Representing each group above) where x, y and z are an, bn and cn respectively.


As a bonus, to "count" more, one can do this:

Pattern: ^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1{3}\2$
Matches: abbbc
aabbbbbbcc
aaabbbbbbbbbccc
Pattern: ^(?:a(?=a*(\1?+bbb)b*(\2?+c)))+\1\2$
Matches: abbbc
aabbbbbbcc
aaabbbbbbbbbccc
like image 124
Unihedron Avatar answered Sep 24 '22 12:09

Unihedron