Let's say we have the following input:
<amy>
(bob)
<carol)
(dean>
We also have the following regex:
<(\w+)>|\((\w+)\)
Now we get two matches (as seen on rubular.com):
<amy>
is a match, \1
captures amy
, \2
fails(bob)
is a match, \2
captures bob
, \1
failsThis regex does most of what we want, which are:
However, it does have a few drawbacks:
\w+
in this case, but generally speaking this can be quite complex,
\1
or \2
in this case, but generally the "main" part can have capturing groups of their own!{...}
, [...]
, etc.So the question is obvious: how can we do this without repeating the "main" pattern?
Note: for the most part I'm interested in
java.util.regex
flavor, but other flavors are welcomed.
There's nothing new in this section; it only illustrates the problem mentioned above with an example.
Let's take the above example to the next step: we now want to match these:
<amy=amy>
(bob=bob)
[carol=carol]
But not these:
<amy=amy) # non-matching bracket
<amy=bob> # left hand side not equal to right hand side
Using the alternate technique, we have the following that works (as seen on rubular.com):
<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]
As explained above:
\1 \2
, \3 \4
, or \5 \6
You can use a lookahead to "lock in" the group number before doing the real match.
String s = "<amy=amy>(bob=bob)[carol=carol]";
Pattern p = Pattern.compile(
"(?=[<(\\[]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\])");
Matcher m = p.matcher(s);
while(m.find())
{
System.out.printf("found %s in %s%n", m.group(2), m.group());
}
output:
found amy in <amy=amy>
found bob in (bob=bob)
found carol in [carol=carol]
It's still ugly as hell, but you don't have to recalculate all the group numbers every time you make a change. For example, to add support for curly brackets, it's just:
"(?=[<(\\[{]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\]|\\{\\1\\})"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With