PCRE has a feature called recursive pattern, which can be used to match nested subgroups. For example, consider the "grammar"
Q -> \w | '[' A ';' Q* ','? Q* ']' | '<' A '>'
A -> (Q | ',')*
// to match ^A$.
It can be done in PCRE with the pattern
^((?:,|(\w|\[(?1);(?2)*,?(?2)*\]|<(?1)>))*)$
(Example test case: http://www.ideone.com/L4lHE)
abcdefg abc,def,ghi abc,,,def ,,,,,, [abc;] [a,bc;] sss[abc;d] as[abc;d,e] [abc;d,e][fgh;j,k]
<abc> [<a>b;<c,d>,<e,f>] <a,b,c> <a,bb,c> <,,,> <> <><> <>,<> a<<<<>>><a>> <<<<<>>>><><<<>>>>
<z>[a;b] <z[a;b]> [[;];] [,;,] [;[;]] [<[;]>;<[;][;,<[;,]>]>]
<a bc> <abc<de> [a<b;c>;d,e] [a] <<<<<>>>><><<<>>>>> <<<<<>>>><><<<>>> [abc;def;] [[;],] [;,,] [abc;d,e,f]
[<[;]>;<[;][;,<[;,]>]]> <z[a;b>]
There is no recursive pattern in .NET. Instead, it provides balancing groups for stack-based manipulation for matching simple nested patterns.
Is it possible to convert the above PCRE pattern into .NET Regex style?
(Yes I know it's better not to use regex in for this. It's just a theoretical question.)
The answer is (probably) Yes.
The technique is much more complex than the (?1) recursive call, but the result is  almost 1-to-1 with the rules of the grammar - I worked in a such methodical way I can easily see it scripted. Basically, you match block-by-block, and use the stack to keep track of where you are. This is an almost working solution:
^(?:
    (\w(?<Q>)) # Q1
    |
    (<(?<Angle>))  #Q2 - start <
    |
    (\>(?<-Angle>)(?<-A>)?(?<Q>))  #Q2 - end >, match Q
    |
    (\[(?<Block>))  # Q3 start - [
    |
    (;(?<Semi-Block>)(?<-A>)?)  #Q3 - ; after [
    |
    (\](?<-Semi>)(?<-Q>)*(?<Q>))  #Q3 ] after ;, match Q
    |
    ((,|(?<-Q>))*(?<A>))   #Match an A group
)*$
# Post Conditions
(?(Angle)(?!))
(?(Block)(?!))
(?(Semi)(?!))
It is missing the part of allowing commas in Q->[A;Q*,?Q*], and for some reason allows [A;A], so it matches [;,,] and [abc;d,e,f]. Rest of the strings match the same as the test cases.
Another minor point is an issue with pushing to the stack with an empty capture - it doesn't. A accepts Ø, so I had to use (?<-A>)? to check if it captured.
The whole regex should look like this, but again, it is useless with the bug there.
There is not way of synchronizing the stacks: if I push (?<A>) and (?<B>), I can pop them in any order. That is why this pattern cannot differentiate <z[a;b>] from <z[a;b]>... we need one stack for all.
This can be solved for simple cases, but here we have something much more complicate - A whole Q or A, not just "<" or "[".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With