Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do conditionals in lookaround groups work in .NET regex?

Tags:

c#

regex

theory

Playing around with regular expressions, especially the balanced matching of the .NET flavor, I came to a point where I realized that I do not understand the inner workings of the engine as good as I thought I did. I'd appriciate any input on why my patterns behave the way they do! But fist...

Disclaimer: This question is purely theoretical, and any result obtained here will never be used, or modified and used in production code to parse HTML. Ever. I promise. I do fear the pony. =)

Now to my problem. I'll try to match the letter A, if it is not preceeded by an #. To demonstrate, I'll alway use the string ..A..#..A... Here, the first A should be matched. Of course, this is a quite easy task by using "A(?<!^.*#.*)", but I wish to use conditionals here, since they can be used for balanced matchings and other cool things.

What I tried is

"A(?<=^(#(?<q>)|[^#])*(?(q)(?!)))"

The way I interpret it is: when the engine encounteres an "A", it goes back to the start of the string, and for every character add an empty match to the capturing group q if the character is a #. Then it should fail if q contains a match. What I don't understand is why this expression matches both As in my sample string.

When I simply remove the lookbehind and match the whole string, this works:

"^(#(?<q>)|[^#])*(?(q)(?!))A"

matches the whole string up to the first A, even if the first group's quantifier is greedy. Inserting a '#' at the beginning will also cause the match to fail (as desired).

So: how do look around groups, named capturing groups within them and conditionals play together?

Thanks!

Edit: This problem can be seen more easily in (?<=(?<q>)(?(q)(?!)))., which should not match any character, but matches everything.

like image 326
Jens Avatar asked Jul 14 '10 13:07

Jens


1 Answers

Conditionals aren't really that useful in balanced matching--or anywhere else, for that matter. ;) Balanced matching works by using a named capture group as a stack; every time that group matches something, the matched text is pushed onto the stack. There's also special syntax for popping the stack. Here's a good introduction:

http://blog.stevenlevithan.com/archives/balancing-groups

like image 169
Alan Moore Avatar answered Sep 22 '22 06:09

Alan Moore