Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing <thisPartOnly> and (thisPartOnly) with the same group

Let's say we have the following input:

<amy>
(bob)
<carol)
(dean>

We also have the following regex:

<(\w+)>|\((\w+)\)

Now we get two matches (as seen on rubular.com):

  • <amy> is a match, \1 captures amy, \2 fails
  • (bob) is a match, \2 captures bob, \1 fails

This regex does most of what we want, which are:

  • It matches the open and close brackets properly (i.e. no mixing)
  • It captures the part we're interested in

However, it does have a few drawbacks:

  • The capturing pattern (i.e. the "main" part) is repeated
    • It's only \w+ in this case, but generally speaking this can be quite complex,
      • If it involves backreferences, then they must be renumbered for each alternate!
      • Repetition makes maintenance a nightmare! (what if it changes?)
  • The groups are essentially duplicated
    • Depending on which alternate matches, we must query different groups
      • It's only \1 or \2 in this case, but generally the "main" part can have capturing groups of their own!
    • Not only is this inconvenient, but there may be situations where this is not feasible (e.g. when we're using a custom regex framework that is limited to querying only one group)
  • The situation quickly worsens if we also want to match {...}, [...], etc.

So the question is obvious: how can we do this without repeating the "main" pattern?

Note: for the most part I'm interested in java.util.regex flavor, but other flavors are welcomed.


Appendix

There's nothing new in this section; it only illustrates the problem mentioned above with an example.

Let's take the above example to the next step: we now want to match these:

<amy=amy>
(bob=bob)
[carol=carol]

But not these:

<amy=amy)   # non-matching bracket
<amy=bob>   # left hand side not equal to right hand side

Using the alternate technique, we have the following that works (as seen on rubular.com):

<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]

As explained above:

  • The main pattern can't simply be repeated; backreferences must be renumbered
  • Repetition also means maintenance nightmare if it ever changes
  • Depending on which alternate matches, we must query either \1 \2, \3 \4, or \5 \6
like image 344
polygenelubricants Avatar asked Dec 23 '22 01:12

polygenelubricants


1 Answers

You can use a lookahead to "lock in" the group number before doing the real match.

String s = "<amy=amy>(bob=bob)[carol=carol]";
Pattern p = Pattern.compile(
  "(?=[<(\\[]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\])");
Matcher m = p.matcher(s);

while(m.find())
{
  System.out.printf("found %s in %s%n", m.group(2), m.group());
}

output:

found amy in <amy=amy>
found bob in (bob=bob)
found carol in [carol=carol]

It's still ugly as hell, but you don't have to recalculate all the group numbers every time you make a change. For example, to add support for curly brackets, it's just:

"(?=[<(\\[{]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\]|\\{\\1\\})"
like image 194
Alan Moore Avatar answered Dec 28 '22 11:12

Alan Moore