Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing Quantifiers and Quantifier Arithmetic

At the outset, let me explain that this question is neither about how to capture groups, nor about how to use quantifiers, two features of regex I am perfectly familiar with. It is more of an advanced question for regex lovers who may be familiar with unusual syntax in exotic engines.

Capturing Quantifiers

Does anyone know if a regex flavor allows you to capture quantifiers? By this, I mean that the number of characters matched by quantifiers such as + and * would be counted, and that this number could be used again in another quantifier.

For instance, suppose you wanted to make sure you have the same number of Ls and Rs in this kind of string: LLLRRRRR

You could imagine a syntax such as

L(+)R{\q1}

where the + quantifier for the L is captured, and where the captured number is referred to in the quantifier for the R as {\q1}

This would be useful to balance the number of {@,=,-,/} in strings such as @@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"

Relation to Recursion

In some cases quantifier capture would elegantly replace recursion, for instance a piece of text framed by the same number of Ls and Rs, a in

L(+) some_content R{\q1} 

The idea is presented in some details on the following page: Captured Quantifiers

It also discusses a natural extension to captured quantifers: quantifier arithmetic, for occasions when you want to match (3*x + 1) the number of characters matched earlier.

I am trying to find out if anything like this exists.

Thanks in advance for your insights!!!

Update

Casimir gave a fantastic answer that shows two methods to validate that various parts of a pattern have the same length. However, I wouldn't want to rely on either of those for everyday work. These are really tricks that demonstrate great showmanship. In my mind, these beautiful but complex methods confirm the premise of the question: a regex feature to capture the number of characters that quantifers (such as + or *) are able to match would make such balancing patterns very simple and extend the syntax in a pleasingly expressive way.

Update 2 (much later)

I found out that .NET has a feature that comes close to what I was asking about. Added an answer to demonstrate the feature.

like image 669
zx81 Avatar asked Apr 10 '14 23:04

zx81


1 Answers

I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:

@@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"

you can check if @ = - / are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"
~(?<!@)((?:@(?=[^=]*(\2?+=)[^-]*(\3?+-)[^/]*(\4?+/)))+)(?!@)(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))~

pattern details:

~                          # pattern delimiter
(?<!@)                     # negative lookbehind used as an @ boundary
(                          # first capturing group for the @
    (?:
        @                  # one @
        (?=                # checks that each @ is followed by the same number
                           # of = - /  
            [^=]*          # all that is not an =
            (\2?+=)        # The possessive optional self-referencing group:
                           # capture group 2: backreference to itself + one = 
            [^-]*(\3?+-)   # the same for -
            [^/]*(\4?+/)   # the same for /
        )                  # close the lookahead
    )+                     # close the non-capturing group and repeat
)                          # close the first capturing group
(?!@)                      # negative lookahead used as an @ boundary too.

# this checks the boundaries for all groups
(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))
~

The main idea

The non-capturing group contains only one @. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.

the possessive-optional self-referencing group

How does it work?

( (?: @ (?= [^=]* (\2?+ = ) .....) )+ )

At the first occurence of the @ character the capture group 2 is not yet defined, so you can not write something like that (\2 =) that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: \2?

The second aspect of this group is that the number of character = matched is incremented at each repetition of the non capturing group, since an = is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new = character.

Note that this group can be seen like that: if group 2 exists then match it with the next =

( (?(2)\2) = )

The recursive way

~(?<!@)(?=(@(?>[^@=]+|(?-1))*=)(?!=))(?=(@(?>[^@-]+|(?-1))*-)(?!-))(?=(@(?>[^@/]+|(?-1))*/)(?!/))~

You need to use overlapped matches, since you will use the @ part several times, it is the reason why all the pattern is inside lookarounds.

pattern details:

(?<!@)                # left @ boundary
(?=                   # open a lookahead (to allow overlapped matches)
    (                 # open a capturing group
        @
        (?>           # open an atomic group
            [^@=]+    # all that is not an @ or an =, one or more times
          |           # OR
            (?-1)     # recursion: the last defined capturing group (the current here)
        )*            # repeat zero or more the atomic group
        =             #
    )                 # close the capture group
    (?!=)             # checks the = boundary
)                     # close the lookahead
(?=(@(?>[^@-]+|(?-1))*-)(?!-))  # the same for -
(?=(@(?>[^@/]+|(?-1))*/)(?!/))  # the same for /

The main difference with the precedent pattern is that this one doesn't care about the order of = - and / groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)

Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^ or \A). And if you want to obtain the whole string as match result you must add .* at the end (otherwise the match result will be empty as playful notices it.)

like image 153
Casimir et Hippolyte Avatar answered Oct 09 '22 16:10

Casimir et Hippolyte