Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: capturing groups within capture groups

Tags:

People also ask

What is a regex capturing group?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .

When capturing regex groups what datatype does the groups method return?

The re. groups() method This method returns a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.

What is non-capturing group in regex?

Non-capturing groups are important constructs within Java Regular Expressions. They create a sub-pattern that functions as a single unit but does not save the matched character sequence. In this tutorial, we'll explore how to use non-capturing groups in Java Regular Expressions.

How do you find the substring that matched the last capturing group of the regex in Python?

To get access to the text matched by each regex group, pass the group's number to the group(group_number) method. So the first group will be a group of 1. The second group will be a group of 2 and so on. So this is the simple way to access each of the groups as long as the patterns were matched.


Intro

(you can skip to What if... if you get bored with intros)

This question is not directed to VBScript particularly (I just used it in this case): I want to find a solution for general regular expressions usage (editors included).

This started when I wanted to create an adaptation of Example 4 where 3 capture groups are used to split data across 3 cells in MS Excel. I needed to capture one entire pattern and then, within it, capture 3 other patterns. However, in the same expression, I also needed to capture another kind of pattern and again capture 3 other patterns within it (yeah I know... but before pointing the nutjob finger, please finish reading).

I thought first of Named Capturing Groups then I realized that I should not «mix named and numbered capturing groups» since it «is not recommended because flavors are inconsistent in how the groups are numbered».

Then I looked into VBScript SubMatches and «non-capturing» groups and I got a working solution for a specific case:

For Each C In Myrange
    strPattern = "(?:^([0-9]+);([0-9]+);([0-9]+)$|^.*:([0-9]+)\s.*:([0-9]+).*:([a-zA-Z0-9]+)$)"

    If strPattern <> "" Then
        strInput = C.Value

        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = False
            .Pattern = strPattern
        End With

        Set rgxMatches = regEx.Execute(strInput)

        For Each mtx In rgxMatches
            If mtx.SubMatches(0) <> "" Then
                C.Offset(0, 1) = mtx.SubMatches(0)
                C.Offset(0, 2) = mtx.SubMatches(1)
                C.Offset(0, 3) = mtx.SubMatches(2)
            ElseIf mtx.SubMatches(3) <> "" Then
                C.Offset(0, 1) = mtx.SubMatches(3)
                C.Offset(0, 2) = mtx.SubMatches(4)
                C.Offset(0, 3) = mtx.SubMatches(5)
            Else
                C.Offset(0, 1) = "(Not matched)"
            End If
        Next
    End If
Next

Here's a demo in Rubular of the regex. In these:

124;12;3
my id1:213 my id2:232 my word:ins4yanrgx
:8587459 :18254182540215 :dcpt
0;1;2

It returns the first 2 cells with numbers and the 3rd with a number or a word. Basically I used a non-capturing group with 2 "parent" patterns ("parents" = broad patterns where I want to detect other sub-patterns). If the 1st parent pattern has a matching sub-pattern (1st capture group) then I place its value and the remaining captured groups of this pattern in the 3 cells. If not, I check if the 4th capture group (belonging to the 2nd parent pattern) was matched and place the remaining sub-patterns in the same 3 cells.

What if...

Instead of having something like this:

(?:^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever))

Something like this could be possible:

(#:^(\d+);(\d+);(\d+)$)|(#:^.*:(\d+)\s.*:(\d+).*:(\w+)$)|(#:what(ever))

Where (#: instead of creating a non-capturing group, would create a "parent" numbered capture group. In this way I could do something similar to Example 4:

C.Offset(0, 1) = regEx.Replace(strInput, "#$1")
C.Offset(0, 2) = regEx.Replace(strInput, "#$2")
C.Offset(0, 3) = regEx.Replace(strInput, "#$3")

It would search parent patterns until it finds a match in a child pattern (the first match would be returned and, ideally, wouldn't search the remaining ones).

Is there something like this already? Or am I missing something entirely from regex that allows to do this?

Other possible variations:

  • refer to the parent and child pattern directly, e.g.: #2$3 (this would be equivalent of $6 in my example);
  • create as many capturing groups as necessary within others (I guess it would be more complex but also the most interesting part as well), e.g.: with regex (same syntax) like (#:^_(?:(#:(\d+):\w+-(\d))|(#:\w+:(\d+)-(\d+)))_$)|(#:^\w+:\s+(#:(\w+);\d-(\d+))$) and fetching ##$1 in patterns like:

    _123:smt-4_ it would match in: 123
    _ott:432-10_ it would match in: 432
    yant: special;3-45235 it would match in: special

Please tell me if you noticed any mistakes or flaws in this logic, I will edit asap.