PCRE has a feature called recursive pattern, which can be used to match nested subgroups. For example, consider the "grammar"
Q -> \w | '[' A ';' Q* ','? Q* ']' | '<' A '>'
A -> (Q | ',')*
// to match ^A$.
It can be done in PCRE with the pattern
^((?:,|(\w|\[(?1);(?2)*,?(?2)*\]|<(?1)>))*)$
(Example test case: http://www.ideone.com/L4lHE)
abcdefg
abc,def,ghi
abc,,,def
,,,,,,
[abc;]
[a,bc;]
sss[abc;d]
as[abc;d,e]
[abc;d,e][fgh;j,k]
<abc>
[<a>b;<c,d>,<e,f>]
<a,b,c>
<a,bb,c>
<,,,>
<>
<><>
<>,<>
a<<<<>>><a>>
<<<<<>>>><><<<>>>>
<z>[a;b]
<z[a;b]>
[[;];]
[,;,]
[;[;]]
[<[;]>;<[;][;,<[;,]>]>]
<a
bc>
<abc<de>
[a<b;c>;d,e]
[a]
<<<<<>>>><><<<>>>>>
<<<<<>>>><><<<>>>
[abc;def;]
[[;],]
[;,,]
[abc;d,e,f]
[<[;]>;<[;][;,<[;,]>]]>
<z[a;b>]
There is no recursive pattern in .NET. Instead, it provides balancing groups for stack-based manipulation for matching simple nested patterns.
Is it possible to convert the above PCRE pattern into .NET Regex style?
(Yes I know it's better not to use regex in for this. It's just a theoretical question.)
The answer is (probably) Yes.
The technique is much more complex than the (?1)
recursive call, but the result is almost 1-to-1 with the rules of the grammar - I worked in a such methodical way I can easily see it scripted. Basically, you match block-by-block, and use the stack to keep track of where you are. This is an almost working solution:
^(?:
(\w(?<Q>)) # Q1
|
(<(?<Angle>)) #Q2 - start <
|
(\>(?<-Angle>)(?<-A>)?(?<Q>)) #Q2 - end >, match Q
|
(\[(?<Block>)) # Q3 start - [
|
(;(?<Semi-Block>)(?<-A>)?) #Q3 - ; after [
|
(\](?<-Semi>)(?<-Q>)*(?<Q>)) #Q3 ] after ;, match Q
|
((,|(?<-Q>))*(?<A>)) #Match an A group
)*$
# Post Conditions
(?(Angle)(?!))
(?(Block)(?!))
(?(Semi)(?!))
It is missing the part of allowing commas in Q->[A;Q*,?Q*]
, and for some reason allows [A;A]
, so it matches [;,,]
and [abc;d,e,f]
. Rest of the strings match the same as the test cases.
Another minor point is an issue with pushing to the stack with an empty capture - it doesn't. A
accepts Ø, so I had to use (?<-A>)?
to check if it captured.
The whole regex should look like this, but again, it is useless with the bug there.
There is not way of synchronizing the stacks: if I push (?<A>)
and (?<B>)
, I can pop them in any order. That is why this pattern cannot differentiate <z[a;b>]
from <z[a;b]>
... we need one stack for all.
This can be solved for simple cases, but here we have something much more complicate - A whole Q
or A
, not just "<" or "[".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With