I'm trying to parse string in the following format (EBNF, I hope this is right) in PHP:
<exp> ::= <base>[{<modifier>["!"]"("<exp>")"}]
<base> ::= <role>[{<modifier><role>}]
<modifier> ::= "&" | "|"
<role> ::= ["!"]<str>[","<str>]
Where <str>
is any string that would pass [a-zA-Z0-9\-]+
The following are example of patterns that would have to be parsed:
token1
token1&token2
token1|(token2&!token3)
(token1&token2)|(token3&(token4|(!token5,12&token6)))
!(token1&token2|(token3&!token4))|token5,12
I am trying to write a RegEx pattern that would always give me four groups:
<expression>
. From the above example this would be:
token1
token1
token1
token1&token2
token1&token2|(token3&!token4)
["!"]
was present. I.e.
null
null
null
null
!
<modifier>
for the next <expression>
(if any). This would be:
null
&
|
|
|
null
token2
token2&!token3
token3&(token4|(!token5,12&token6))
token5,12
I can parse this provided that the first expression doesn't contain any <modifier>
s.
^\(?(!?)([a-zA-Z0-9\-]+)\)?([&|]?)(.*)$
I am stuck at this point. I have tried using lookarounds, however I can't figure out how to ensure that the group is captured when all brackets are balanced. Is this achievable with RegEx or do I need to write code using loops etc. to do this?
As far as I know, it is impossible.
You have a context-free grammar (EBNF is for this type of grammars - Type-2 grammars), which cannot be parsed with regular expressions (which are for regular grammars - Type-3 grammars).
http://en.wikipedia.org/wiki/Chomsky_hierarchy
As an example of the thing you cannot handle here: number of opening parantheses - you can only write one regexp for each number of these (but there can be infinite, right?), otherwise there is no way to tell if the number of matching closing parantheses is the same. There is no way to count how many chars mathed by the specific part of regexp with quantifiers (+
, *
, etc.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With