Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)

Tags:

regex

php

pcre

It is a well known fact that modern regular expression implementations (most notably PCRE) have little in common with the original notion of regular grammars. For example you can parse the classical example of a context-free grammar {anbn; n>0} (e.g. aaabbb) using this regex (demo):

~^(a(?1)?b)$~ 

My question is: How far can you go? Is it also possible to parse the context-sensitive grammar {anbncn;n>0} (e.g. aaabbbccc) using PCRE?

like image 383
NikiC Avatar asked Sep 15 '11 16:09

NikiC


People also ask

How do you match a regular expression?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" .

What is PCRE format?

PCRE StandardMatch a single character of any value, except end of line. Match the start of the line, as in ^A. Match the end of the line, as in A$. Match either the regular expression preceding it or the regular expression following it.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).


2 Answers

Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself:

$regex = '~^     (?=(a(?-1)?b)c)      a+(b(?-1)?c) $~x';  var_dump(preg_match($regex, 'aabbcc'));    // 1 var_dump(preg_match($regex, 'aaabbbccc')); // 1 var_dump(preg_match($regex, 'aaabbbcc'));  // 0 var_dump(preg_match($regex, 'aaaccc'));    // 0 var_dump(preg_match($regex, 'aabcc'));     // 0 var_dump(preg_match($regex, 'abbcc'));     // 0 

Try it yourself: http://codepad.viper-7.com/1erq9v


Explanation

If you consider the regex without the positive lookahead assertion (the (?=...) part), you have this:

~^a+(b(?-1)?c)$~ 

This does nothing more than check that there's an arbitrary number of as, followed by an equal number of bs and cs.

This doesn't yet satisfy our grammar, because the number of as must be the same, too. We can ensure that by checking that the number of as equals the number of bs. And this is what the expression in the lookahead assertion does: (a(?-1)?b)c. The c is necessary so we don't only match a part of the bs.


Conclusion

I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"

like image 160
NikiC Avatar answered Oct 01 '22 20:10

NikiC


Here is an alternative solution using balancing groups with .NET regex:

^(?'a'a)+(?'b-a'b)+(?(a)(?!))(?'c-b'c)+(?(b)(?!))$ 

Not PCRE, but may be of interest.

Example at: http://ideone.com/szhuE

Edit: Added the missing balancing check for the group a, and an online example.

like image 25
Qtax Avatar answered Oct 01 '22 21:10

Qtax