I was just reading a question about how to get data inside double curly braces (this question), and then someone brought up balancing groups. I'm still not quite sure what they are and how to use them. I read through Balancing Group Definition, but the explanation is hard to follow, and I'm still quite confused on the questions that I mentioned. Could someone simply explain what balancing groups are and how they are useful?

Just a small addition to M. Buettner's excellent answer: <h3>What's the deal with the <code>(?<A-B>)</code> syntax?</h3> <code>(?<A-B>x)</code> is subtly different from <code>(?<-A>(?x))</code>. They result in the same control flow*, but they capture differently. For example, let's look at a pattern for balanced braces: <pre class="prettyprint"><code>(?:[^{}]|(?{)|(?<-B>}))+(?(B)(?!)) </code></pre> At the end of the match we do have a balanced string, but that is all we have - we don't know where the braces are because the <code>B</code> stack is empty. The hard work the engine did for us is gone. (example on Regex Storm) <code>(?<A-B>x)</code> is the solution for that problem. How? It doesn't capture <code>x</code> into <code>$A</code>: it captures the content between the previous capture of <code>B</code> and the current position. Let's use it in our pattern: <pre class="prettyprint"><code>(?:[^{}]|(?<Open>{)|(?<Content-Open>}))+(?(Open)(?!)) </code></pre> This would capture into <code>$Content</code> the strings between the braces (and their positions), for each pair along the way. For the string <code>{1 2 {3} {4 5 {6}} 7}</code> there'd be four captures: <code>3</code>, <code>6</code> ,<code>4 5 {6}</code> , and <code>1 2 {3} {4 5 {6}} 7</code> - much better than nothing or <code>}</code> <code>}</code> <code>}</code> <code>}</code>. (example - click the <code>table</code> tab and look at <code>${Content}</code>, captures) In fact, it can be used without balancing at all: <code>(?<A>).(.(?<Content-A>).)</code> captures the first two characters, even though they are separated by groups. (a lookahead is more commonly used here but it doesn't always scale: it may duplicate your logic.) <code>(?<A-B>)</code> is a strong feature - it gives you exact control over your captures. Keep that in mind when you're trying to get more out of your pattern.

What are regular expression Balancing Groups?

Tags:

c#

.net

regex

balancing-groups

I was just reading a question about how to get data inside double curly braces (this question), and then someone brought up balancing groups. I'm still not quite sure what they are and how to use them.

I read through Balancing Group Definition, but the explanation is hard to follow, and I'm still quite confused on the questions that I mentioned.

Could someone simply explain what balancing groups are and how they are useful?

356

asked Jun 08 '13 21:06

It'sNotALie.

2 Answers

As far as I know, balancing groups are unique to .NET's regex flavor.

Aside: Repeated Groups

First, you need to know that .NET is (again, as far as I know) the only regex flavor that lets you access multiple captures of a single capturing group (not in backreferences but after the match has completed).

To illustrate this with an example, consider the pattern

(.)+

and the string "abcd".

in all other regex flavors, capturing group 1 will simply yield one result: d (note, the full match will of course be abcd as expected). This is because every new use of the capturing group overwrites the previous capture.

.NET on the other hand remembers them all. And it does so in a stack. After matching the above regex like

Match m = new Regex(@"(.)+").Match("abcd");

you will find that

m.Groups[1].Captures

Is a CaptureCollection whose elements correspond to the four captures

0: "a" 1: "b" 2: "c" 3: "d"

where the number is the index into the CaptureCollection. So basically every time the group is used again, a new capture is pushed onto the stack.

It gets more interesting if we are using named capturing groups. Because .NET allows repeated use of the same name we could write a regex like

(?<word>\w+)\W+(?<word>\w+)

to capture two words into the same group. Again, every time a group with a certain name is encountered, a capture is pushed onto its stack. So applying this regex to the input "foo bar" and inspecting

m.Groups["word"].Captures

we find two captures

0: "foo" 1: "bar"

This allows us to even push things onto a single stack from different parts of the expression. But still, this is just .NET's feature of being able to track multiple captures which are listed in this CaptureCollection. But I said, this collection is a stack. So can we pop things from it?

Enter: Balancing Groups

It turns out we can. If we use a group like (?<-word>...), then the last capture is popped from the stack word if the subexpression ... matches. So if we change our previous expression to

(?<word>\w+)\W+(?<-word>\w+)

Then the second group will pop the first group's capture, and we will receive an empty CaptureCollection in the end. Of course, this example is pretty useless.

But there's one more detail to the minus-syntax: if the stack is already empty, the group fails (regardless of its subpattern). We can leverage this behavior to count nesting levels - and this is where the name balancing group comes from (and where it gets interesting). Say we want to match strings that are correctly parenthesized. We push each opening parenthesis on the stack, and pop one capture for each closing parenthesis. If we encounter one closing parenthesis too many, it will try to pop an empty stack and cause the pattern to fail:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*$

So we have three alternatives in a repetition. The first alternative consumes everything that is not a parenthesis. The second alternative matches (s while pushing them onto the stack. The third alternative matches )s while popping elements from the stack (if possible!).

_{Note: Just to clarify, we're only checking that there are no unmatched parentheses! This means that string containing no parentheses at all will match, because they are still syntactically valid (in some syntax where you need your parentheses to match). If you want to ensure at least one set of parentheses, simply add a lookahead (?=.*[(]) right after the ^.}

This pattern is not perfect (or entirely correct) though.

Finale: Conditional Patterns

There is one more catch: this does not ensure that the stack is empty at the end of the string (hence (foo(bar) would be valid). .NET (and many other flavors) have one more construct that helps us out here: conditional patterns. The general syntax is

(?(condition)truePattern|falsePattern)

where the falsePattern is optional - if it is omitted the false-case will always match. The condition can either be a pattern, or the name of a capturing group. I'll focus on the latter case here. If it's the name of a capturing group, then truePattern is used if and only if the capture stack for that particular group is not empty. That is, a conditional pattern like (?(name)yes|no) reads "if name has matched and captured something (that is still on the stack), use pattern yes otherwise use pattern no".

So at the end of our above pattern we could add something like (?(Open)failPattern) which causes the entire pattern to fail, if the Open-stack is not empty. The simplest thing to make the pattern unconditionally fail is (?!) (an empty negative lookahead). So we have our final pattern:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))$

Note that this conditional syntax has nothing per se to do with balancing groups but it's necessary to harness their full power.

From here, the sky is the limit. Many very sophisticated uses are possible and there are some gotchas when used in combination with other .NET-Regex features like variable-length lookbehinds (which I had to learn the hard way myself). The main question however is always: is your code still maintainable when using these features? You need to document it really well, and be sure that everyone who works on it is also aware of these features. Otherwise you might be better off, just walking the string manually character-by-character and counting nesting levels in an integer.

Addendum: What's with the `(?<A-B>...)` syntax?

Credits for this part go to Kobi (see his answer below for more details).

Now with all of the above, we can validate that a string is correctly parenthesized. But it would be a lot more useful, if we could actually get (nested) captures for all those parentheses' contents. Of course, we could remember opening and closing parentheses in a separate capture stack that is not emptied, and then do some substring extraction based on their positions in a separate step.

But .NET provides one more convenience feature here: if we use (?<A-B>subPattern), not only is a capture popped from stack B, but also everything between that popped capture of B and this current group is pushed onto stack A. So if we use a group like this for the closing parentheses, while popping nesting levels from our stack, we can also push the pair's content onto another stack:

^(?:[^()]|(?<Open>[(])|(?<Content-Open>[)]))*(?(Open)(?!))$

_{Kobi provided this Live-Demo in his answer}

So taking all of these things together we can:

Remember arbitrarily many captures
Validate nested structures
Capture each nesting level

All in a single regular expression. If that's not exciting... ;)

Some resources that I found helpful when I first learned about them:

http://blog.stevenlevithan.com/archives/balancing-groups
MSDN on balancing groups
MSDN on conditional patterns
http://kobikobi.wordpress.com/tag/balancing-group/ (slightly academic, but has some interesting applications)

answered Sep 30 '22 22:09

Martin Ender

Just a small addition to M. Buettner's excellent answer:

What's the deal with the `(?<A-B>)` syntax?

(?<A-B>x) is subtly different from (?<-A>(?x)). They result in the same control flow^*, but they capture differently.
For example, let's look at a pattern for balanced braces:

(?:[^{}]|(?<B>{)|(?<-B>}))+(?(B)(?!))

At the end of the match we do have a balanced string, but that is all we have - we don't know where the braces are because the B stack is empty. The hard work the engine did for us is gone.
_{(example on Regex Storm)}

(?<A-B>x) is the solution for that problem. How? It doesn't capture x into $A: it captures the content between the previous capture of B and the current position.

Let's use it in our pattern:

(?:[^{}]|(?<Open>{)|(?<Content-Open>}))+(?(Open)(?!))

This would capture into $Content the strings between the braces (and their positions), for each pair along the way.
For the string {1 2 {3} {4 5 {6}} 7} there'd be four captures: 3, 6 ,4 5 {6} , and 1 2 {3} {4 5 {6}} 7 - much better than nothing or } } } }.
_{(example - click the table tab and look at ${Content}, captures)}

In fact, it can be used without balancing at all: (?<A>).(.(?<Content-A>).) captures the first two characters, even though they are separated by groups.
(a lookahead is more commonly used here but it doesn't always scale: it may duplicate your logic.)

(?<A-B>) is a strong feature - it gives you exact control over your captures. Keep that in mind when you're trying to get more out of your pattern.

answered Sep 30 '22 22:09

Kobi

Related questions
                            
                                Declaring a variable inside or outside an foreach loop: which is faster/better?
                            
                                When is a static constructor called in C#?
                            
                                Run code once before and after ALL tests in xUnit.net
                            
                                C# 6 how to format double using interpolated string?
                            
                                How to connect to MySQL Database?
                            
                                Visual Studio hot keys change occasionally, specifically F6 vs Ctrl-Shift-B for building. WHY?
                            
                                downcast and upcast
                            
                                Proper way to implement a never ending task. (Timers vs Task)
                            
                                string.Join on a List<int> or other type
                            
                                Structs versus classes
                            
                                What is the difference between lambdas and delegates in the .NET Framework?
                            
                                What is the real overhead of try/catch in C#?
                            
                                How do I find out what directory my console app is running in?
                            
                                How can I reset an EF7 InMemory provider between unit tests?
                            
                                Where is the constant for "HttpRequest.RequestType" and "WebRequest.Method" values in .NET?
                            
                                Add data annotations to a class generated by entity framework
                            
                                How to put a UserControl into Visual Studio toolBox
                            
                                How do I alias a class name in C#, without having to add a line of code to every file that uses the class?
                            
                                Exception when AddWithValue parameter is NULL
                            
                                Replace the last occurrence of a word in a string - C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are regular expression Balancing Groups?

Tags:

c#

.net

regex

balancing-groups

It'sNotALie.

People also ask

2 Answers

Aside: Repeated Groups

Enter: Balancing Groups

Finale: Conditional Patterns

Addendum: What's with the `(?<A-B>...)` syntax?

Martin Ender

What's the deal with the `(?<A-B>)` syntax?

Kobi

Recent Activity

Donate For Us

What are regular expression Balancing Groups?

Tags:

c#

.net

regex

balancing-groups

It'sNotALie.

People also ask

2 Answers

Aside: Repeated Groups

Enter: Balancing Groups

Finale: Conditional Patterns

Addendum: What's with the (?<A-B>...) syntax?

Martin Ender

What's the deal with the (?<A-B>) syntax?

Kobi

Related questions

Recent Activity

Donate For Us

Addendum: What's with the `(?<A-B>...)` syntax?

What's the deal with the `(?<A-B>)` syntax?