Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to erase a capture group that has already matched, making it non-participating?

In PCRE2 or any other regex engine supporting forward backreferences, is it possible to change a capture group that matched in a previous iteration of a loop into a non-participating capture group (also known as an unset capture group or non-captured group), causing conditionals that test that group to match with their "false" clause rather than their "true" clause?

For example, take the following PCRE regex:

^(?:(z)?(?(1)aa|a)){2}

When fed the string zaazaa, it matches the whole string, as desired. But when fed zaaaa, I would like it to match zaaa; instead, it matches zaaaa, the whole string. (This is just for illustration. Of course this example could be handled by ^(?:zaa|a){2} but that is beside the point. Practical usage of capture group erasure would tend to be in loops that most often do far more than 2 iterations.)

An alternative way of doing this, which also doesn't work as desired:

^(?:(?:z()|())(?:\1aa|\2a)){2}

Note that both of these work as desired when the loop is "unrolled", because they no longer have to erase a capture that has already been made:

^(?:(z)?(?(1)aa|a))(?:(z)?(?(2)aa|a))
^(?:(?:z()|())(?:\1aa|\2a))(?:(?:z()|())(?:\3aa|\4a))

So instead of being able to use the simplest form of conditional, a more complicated one must be used, which only works in this example because the "true" match of z is non-empty:

^(?:(z?)(?(?!.*$\1)aa|a)){2}

Or just using an emulated conditional:

^(?:(z?)(?:(?!.*$\1)aa|(?=.*$\1)a)){2}

I have scoured all the documentation I can find, and there seems not to even be any mention or explicit description of this behavior (that captures made within a loop persist through iterations of that loop even when they fail to be re-captured).

It's different than what I intuitively expected. The way I would implement it is that evaluating a capture group with 0 repetitions would erase/unset it (so this could happen to any capture group with a *, ?, or {0,N} quantifier), but skipping it due to being in a parallel alternative within the same group in which it gained a capture during a previous iteration would not erase it. Thus, this regex would still match words iff they contain at least one of every vowel:

\b(?:a()|e()|i()|o()|u()|\w)++\1\2\3\4\5\b

But skipping a capture group due to it being inside an unevaluated alternative of a group that is evaluated with nonzero repetitions which is nested within the group in which the capture group took on a value during a previous iteration would erase/unset it, so this regex would be able to either capture or erase group \1 on every iteration of the loop:

^(?:(?=a|(b)).(?(1)_))*$

and would match strings such as aaab_ab_b_aaaab_ab_aab_b_b_aaa. However, the way forward references are actually implemented in existing engines, it matches aaaaab_a_b_a_a_b_b_a_b_b_b_.

I would like to know the answer to this question not merely because it would be useful in constructing regexes, but because I have written my own regex engine, currently ECMAScript-compatible with some optional extensions (including molecular lookahead (?*), i.e. non-atomic lookahead, which as far as I know, no other engine has), and I would like to continue adding features from other engines, including forward/nested backreferences. Not only do I want my implementation of forward backreferences to be compatible with existing implementations, but if there isn't a way of erasing capture groups in other engines, I will probably create a way of doing it in my engine that doesn't conflict with other existing regex features.

To be clear: An answer stating that this is not possible in any mainstream engines will be acceptable, as long as it is backed up by adequate research and/or citing of sources. An answer stating that it is possible would be much easier to state, since it would require only one example.

Some information on what a non-participating capture group is:
http://blog.stevenlevithan.com/archives/npcg-javascript - this is the article that originally introduced me to the idea.
https://www.regular-expressions.info/backref2.html - the first section on this page gives a brief explanation.
In ECMAScript/Javascript regexes, backreferences to NPCGs always match (making a zero-length match). In pretty much every other regex flavor, they fail to match anything.

like image 678
Deadcode Avatar asked Jan 04 '19 23:01

Deadcode


People also ask

What do you put at the start of a group to make it non capturing?

Sometimes you want to use parentheses to group parts of an expression together, but you don't want the group to capture anything from the substring it matches. To do this use (?: and ) to enclose the group.

What is the point of non capturing group?

A non-capturing group lets us use the grouping inside a regular expression without changing the numbers assigned to the back references (explained in the next section). This can be very useful in building large and complex regular expressions.

How do I reference a capture group in regex?

If your regular expression has named capturing groups, then you should use named backreferences to them in the replacement text. The regex (?' name'group) has one group called “name”. You can reference this group with ${name} in the JGsoft applications, Delphi, .

What are regex capture groups?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .


2 Answers

I found this documented in PCRE's man page, under "DIFFERENCES BETWEEN PCRE2 AND PERL":

   12.  There are some differences that are concerned with the settings of
   captured strings when part of  a  pattern  is  repeated.  For  example,
   matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
   unset, but in PCRE2 it is set to "b".

I'm struggling to think of a practical problem that cannot be better solved with an alternative solution, but in the interests of keeping it simple, here goes:

Suppose you have a simple task well-suited to being solved by using forward references; for example, check the input string is a palindrome. This cannot be solved generally with recursion (due to the atomic nature of subroutine calls), and so we bang out the following:

/^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$/

Easy enough. Now suppose we are asked to verify that every line in the input is a palindrome. Let's try to solve this by placing the expression in a repeated group:

\A(?:^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$(?:\n|\z))+\z

Clearly that doesn't work, since the value of \2 persists from the first line to the next. This is similar to the problem you're facing, and so here are a number of ways to overcome it:

1. Enclose the entire subexpression in (?!(?! )):

\A(?:(?!(?!^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$)).+(?:\n|\z))+\z

Very easy, just shove 'em in there and you're essentially good to go. Not a great solution if you want any particular captured values to persist.

2. Branch reset group to reset the value of capture groups:

\A(?|^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$|\n()()|\z)+\z

With this technique, you can reset the value of capture groups from the first (\1 in this case) up to a certain one (\2 here). If you need to keep \1's value but wipe \2, this technique will not work.

3. Introduce a group that captures the remainder of the string from a certain position to help you later identify where you are:

\A(?:^(?:(.)(?=.*(\1(?(2)(?=\2\3\z)\2))([\s\S]*)))*+.?\2$(?:\n|\z))+\z 

The whole rest of the collection of lines is saved in \3, allowing you to reliably check whether you have progressed to the next line (when (?=\2\3\z) is no longer true).

This is one of my favourite techniques because it can be used to solve tasks that seem impossible, such as the ol' matching nested brackets using forward references. With it, you can maintain any other capture information you need. The only downside is that it's horribly inefficient, especially for long subjects.

4. This doesn't really answer the question, but it solves the problem:

\A(?![\s\S]*^(?!(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$))

This is the alternative solution I was talking about. Basically, "re-write the pattern" :) Sometimes it's possible, sometimes it isn't.

like image 108
jaytea Avatar answered Oct 01 '22 12:10

jaytea


With PCRE (and all as I'm aware) it's not possible to unset a capturing group but using subroutine calls since their nature doesn't remember values from the previous recursion, you are able to accomplish the same task:

(?(DEFINE)((z)?(?(2)aa|a)))^(?1){2}

See live demo here

If you are going to implement a behavior into your own regex flavor to unset a capturing group, I'd strongly suggest do not let it happen automatically. Just provide some flags.

like image 20
revo Avatar answered Oct 01 '22 14:10

revo