Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing inner items using .net Regex Balanced Matching

Tags:

.net

regex

I have found the following resources on Balanced Matching for .net Regexes:

  • http://weblogs.asp.net/whaggard/archive/2005/02/20/377025.aspx
  • http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx
  • http://msdn.microsoft.com/en-us/library/bs2twtah(VS.85).aspx#BalancingGroupDefinitionExample

From what I have read in these, the following example should work:

This regex should find an "a" anywhere within an angle-bracket group, no matter how deep. It should match "<a>", "<<a>>", "<a<>>", "<<>a>", "<<><a>>", etc.

(?<=
    ^
    (
        (
            <(?<Depth>)
            |
            >(?<-Depth>)
        )
        [^<>]*?
    )+?
)
(?(Depth)a|(?!))

matching on the "a" in the string "<<>a>"

While it will work for strings "<a<>>" and "<<a>>", I can't get it to match an "a" that is following a ">".

According to the explanations I have read, the first two "<"s should increment Depth twice, then the first ">" should decrement it once. At this point, (?(Depth)a|(?!)) should perform the "yes" option, but the regex never even makes it here.

Consider the following regex, which makes no such check and still fails to match the string in question:

(?<=
    ^
    (
        (
            <(?<Depth>)
            |
            >(?<-Depth>)
        )
        [^<>]*?
    )+?
)
a

Am I missing something, or is the regex engine working incorrectly?

like image 447
Paul Williams Avatar asked Dec 22 '22 09:12

Paul Williams


1 Answers

You need to remember that a lookbehind only scans as far back as it has to in order to satisfy its embedded regex. The expression in your lookbehind is only required to match an angle bracket, so it only looks as far back as the latest one. If it's a left angle bracket, (?<Depth>) pushes an empty string onto the stack represented by that capture group. But if it's a right angle bracket...

It is worth mentioning that if no named group N exists when trying to pop (<-N>) then it will fail... *

In other words it's not the conditional expression -- (?(Depth)a|(?!)) -- that's making your regex fail (as you observed), it's the attempt to "decrement" the "counter". As far as I can tell, your regex is exactly equivalent to

(?<=<[^<>]*)a

So, to answer your question, .NET's balanced-construct matching is not broken. Byzantine yes, but broken, no. :D

like image 158
Alan Moore Avatar answered Feb 12 '23 13:02

Alan Moore