It is a well known fact that modern regular expression implementations (most notably PCRE) have little in common with the original notion of regular grammars. For example you can parse the classical example of a context-free grammar {anbn; n>0} (e.g. <code>aaabbb</code>) using this regex (demo): <pre class="prettyprint"><code>~^(a(?1)?b)$~ </code></pre> My question is: How far can you go? Is it also possible to parse the context-sensitive grammar {anbncn;n>0} (e.g. <code>aaabbbccc</code>) using PCRE?

Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself: <pre class="prettyprint"><code>$regex = '~^ (?=(a(?-1)?b)c) a+(b(?-1)?c) $~x'; var_dump(preg_match($regex, 'aabbcc')); // 1 var_dump(preg_match($regex, 'aaabbbccc')); // 1 var_dump(preg_match($regex, 'aaabbbcc')); // 0 var_dump(preg_match($regex, 'aaaccc')); // 0 var_dump(preg_match($regex, 'aabcc')); // 0 var_dump(preg_match($regex, 'abbcc')); // 0 </code></pre> Try it yourself: http://codepad.viper-7.com/1erq9v <hr> <h3>Explanation</h3> If you consider the regex without the positive lookahead assertion (the <code>(?=...)</code> part), you have this: <pre class="prettyprint"><code>~^a+(b(?-1)?c)$~ </code></pre> This does nothing more than check that there's an arbitrary number of <code>a</code>s, followed by an equal number of <code>b</code>s and <code>c</code>s. This doesn't yet satisfy our grammar, because the number of <code>a</code>s must be the same, too. We can ensure that by checking that the number of <code>a</code>s equals the number of <code>b</code>s. And this is what the expression in the lookahead assertion does: <code>(a(?-1)?b)c</code>. The <code>c</code> is necessary so we don't only match a part of the <code>b</code>s. <hr> <h3>Conclusion</h3> I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"

Here is an alternative solution using balancing groups with .NET regex: <pre class="prettyprint"><code>^(?'a'a)+(?'b-a'b)+(?(a)(?!))(?'c-b'c)+(?(b)(?!))$ </code></pre> Not PCRE, but may be of interest. Example at: http://ideone.com/szhuE Edit: Added the missing balancing check for the group a, and an online example.

Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)

Tags:

regex

php

pcre

It is a well known fact that modern regular expression implementations (most notably PCRE) have little in common with the original notion of regular grammars. For example you can parse the classical example of a context-free grammar {aⁿbⁿ; n>0} (e.g. aaabbb) using this regex (demo):

~^(a(?1)?b)$~

My question is: How far can you go? Is it also possible to parse the context-sensitive grammar {aⁿbⁿcⁿ;n>0} (e.g. aaabbbccc) using PCRE?

383

asked Sep 15 '11 16:09

NikiC

2 Answers

Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself:

$regex = '~^     (?=(a(?-1)?b)c)      a+(b(?-1)?c) $~x';  var_dump(preg_match($regex, 'aabbcc'));    // 1 var_dump(preg_match($regex, 'aaabbbccc')); // 1 var_dump(preg_match($regex, 'aaabbbcc'));  // 0 var_dump(preg_match($regex, 'aaaccc'));    // 0 var_dump(preg_match($regex, 'aabcc'));     // 0 var_dump(preg_match($regex, 'abbcc'));     // 0

Try it yourself: http://codepad.viper-7.com/1erq9v

Explanation

If you consider the regex without the positive lookahead assertion (the (?=...) part), you have this:

~^a+(b(?-1)?c)$~

This does nothing more than check that there's an arbitrary number of as, followed by an equal number of bs and cs.

This doesn't yet satisfy our grammar, because the number of as must be the same, too. We can ensure that by checking that the number of as equals the number of bs. And this is what the expression in the lookahead assertion does: (a(?-1)?b)c. The c is necessary so we don't only match a part of the bs.

Conclusion

I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"

160

answered Oct 01 '22 20:10

NikiC

Here is an alternative solution using balancing groups with .NET regex:

^(?'a'a)+(?'b-a'b)+(?(a)(?!))(?'c-b'c)+(?(b)(?!))$

Not PCRE, but may be of interest.

Example at: http://ideone.com/szhuE

Edit: Added the missing balancing check for the group a, and an online example.

answered Oct 01 '22 21:10

Qtax

Related questions
                            
                                How to add extra whitespace in PHP?
                            
                                PHP Lumen Call to a member function connection() on null
                            
                                Laravel 5.1 API Enable Cors
                            
                                Updating php version on mac
                            
                                How to create a HTML Table from a PHP array?
                            
                                PHP: Limit foreach() statement? [closed]
                            
                                PHP Timezone List
                            
                                php strtotime "last monday" if today is monday?
                            
                                Why does in_array() wrongly return true with these (large numeric) strings?
                            
                                Limit array to 5 items
                            
                                Is micro-optimization worth the time?
                            
                                Run executable from php without spawning a shell
                            
                                CodeIgniter sessions vs PHP sessions
                            
                                php.ini changes, but not effective on Ubuntu
                            
                                Getting Google PageRank via an API (PHP) [closed]
                            
                                What is a templating language?
                            
                                How to run a PHP script from the command line with MAMP?
                            
                                Address in mailbox given [] does not comply with RFC 2822, 3.6.2. when email is in a variable
                            
                                Cannot use [] for reading
                            
                                This is the .htaccess code in WordPress. Can someone explain how it works?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)

Tags:

regex

php

pcre

NikiC

People also ask

2 Answers

Explanation

Conclusion

NikiC

Qtax

Recent Activity

Donate For Us