Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make balancing group capturing?

Let's say I have this text input.

 tes{}tR{R{abc}aD{mnoR{xyz}}}

I want to extract the ff output:

 R{abc}
 R{xyz}
 D{mnoR{xyz}}
 R{R{abc}aD{mnoR{xyz}}}

Currently, I can only extract what's inside the {}groups using balanced group approach as found in msdn. Here's the pattern:

 ^[^{}]*(((?'Open'{)[^{}]*)+((?'Target-Open'})[^{}]*)+)*(?(Open)(?!))$

Does anyone know how to include the R{} and D{} in the output?

like image 550
sessionista Avatar asked Oct 21 '22 00:10

sessionista


1 Answers

I think that a different approach is required here. Once you match the first larger group R{R{abc}aD{mnoR{xyz}}} (see my comment about the possible typo), you won't be able to get the subgroups inside as the regex doesn't allow you to capture the individual R{ ... } groups.

So, there had to be some way to capture and not consume and the obvious way to do that was to use a positive lookahead. From there, you can put the expression you used, albeit with some changes to adapt to the new change in focus, and I came up with:

(?=([A-Z](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!))))

[I also renamed the 'Open' to 'O' and removed the named capture for the close brace to make it shorter and avoid noises in the matches]

On regexhero.net (the only free .NET regex tester I know so far), I got the following capture groups:

1: R{R{abc}aD{mnoR{xyz}}}
1: R{abc}
1: D{mnoR{xyz}}
1: R{xyz}

Breakdown of regex:

(?=                         # Opening positive lookahead
    ([A-Z]                  # Opening capture group and any uppercase letter (to match R & D)
        (?:                 # First non-capture group opening
            (?:             # Second non-capture group opening
                (?'O'{)     # Get the named opening brace
                [^{}]*      # Any non-brace
            )+              # Close of second non-capture group and repeat over as many times as necessary
            (?:             # Third non-capture group opening
                (?'-O'})    # Removal of named opening brace when encountered
                [^{}]*?     # Any other non-brace characters in case there are more nested braces
            )+              # Close of third non-capture group and repeat over as many times as necessary
        )+                  # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces
        (?(O)(?!))          # Condition to prevent unbalanced braces
    )                       # Close capture group
)                           # Close positive lookahead

The following will not work in C#

I actually wanted to try out how it should be working out on the PCRE engine, since there was the option to have recursive regex and I think it was easier since I'm more familiar with it and which yielded a shorter regex :)

(?=([A-Z]{(?:[^{}]|(?1))+}))

regex101 demo

(?=                    # Opening positive lookahead
    ([A-Z]             # Opening capture group and any uppercase letter (to match R & D)
        {              # Opening brace
            (?:        # Opening non-capture group
                [^{}]  # Matches non braces
            |          # OR
                (?1)   # Recurse first capture group
            )+         # Close non-capture group and repeat as many times as necessary
        }              # Closing brace
    )                  # Close of capture group
)                      # Close of positive lookahead
like image 165
Jerry Avatar answered Nov 01 '22 11:11

Jerry