Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(PHP) Parsing RegEx string - balancing brackets

Tags:

regex

php

I'm trying to parse string in the following format (EBNF, I hope this is right) in PHP:

<exp>      ::= <base>[{<modifier>["!"]"("<exp>")"}]
<base>     ::= <role>[{<modifier><role>}]
<modifier> ::= "&" | "|"
<role>     ::= ["!"]<str>[","<str>]

Where <str> is any string that would pass [a-zA-Z0-9\-]+

The following are example of patterns that would have to be parsed:

token1
token1&token2
token1|(token2&!token3)
(token1&token2)|(token3&(token4|(!token5,12&token6)))
!(token1&token2|(token3&!token4))|token5,12

I am trying to write a RegEx pattern that would always give me four groups:

  1. The left-most <expression>. From the above example this would be:
    • token1
    • token1
    • token1
    • token1&token2
    • token1&token2|(token3&!token4)
  2. If ["!"] was present. I.e.
    • null
    • null
    • null
    • null
    • !
  3. The <modifier> for the next <expression> (if any). This would be:
    • null
    • &
    • |
    • |
    • |
  4. The remaining of the pattern.
    • null
    • token2
    • token2&!token3
    • token3&(token4|(!token5,12&token6))
    • token5,12

I can parse this provided that the first expression doesn't contain any <modifier>s.

^\(?(!?)([a-zA-Z0-9\-]+)\)?([&|]?)(.*)$

I am stuck at this point. I have tried using lookarounds, however I can't figure out how to ensure that the group is captured when all brackets are balanced. Is this achievable with RegEx or do I need to write code using loops etc. to do this?

like image 424
Bart Platak Avatar asked Jul 07 '12 20:07

Bart Platak


1 Answers

As far as I know, it is impossible.

You have a context-free grammar (EBNF is for this type of grammars - Type-2 grammars), which cannot be parsed with regular expressions (which are for regular grammars - Type-3 grammars).

http://en.wikipedia.org/wiki/Chomsky_hierarchy

As an example of the thing you cannot handle here: number of opening parantheses - you can only write one regexp for each number of these (but there can be infinite, right?), otherwise there is no way to tell if the number of matching closing parantheses is the same. There is no way to count how many chars mathed by the specific part of regexp with quantifiers (+, *, etc.)

like image 143
scriptin Avatar answered Oct 13 '22 21:10

scriptin