After hours of searching I decided to ask this question. Why doesn't this regular expression ^(dog).+?(cat)?
work as I think it should work (i.e. capture the first dog and cat if there is any)? What am I missing here?
dog, cat dog, dog, cat dog, dog, dog
You can make several tokens optional by grouping them together using parentheses, and placing the question mark after the closing parenthesis.
Non-capturing groups are important constructs within Java Regular Expressions. They create a sub-pattern that functions as a single unit but does not save the matched character sequence. In this tutorial, we'll explore how to use non-capturing groups in Java Regular Expressions.
Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .
The '?' after the '(administratively )' capture group, basically tells the regex that the previous group/character is optional.
So why do you use the non-capturing group anyway? the reason for using the non-capturing group is to save memory, as the regex engine doesn’t need to store the groups in the buffer. Use the regex non-capturing group to create a group but don’t save it in the groups of the match.
The regexp ^(dog)(.+(cat))?would require you to capture group no. 3 instead of 2 to get the optional cat, but works just as well without the char-by-char trickery.
To capture the minor version only, you can ignore the non-capturing group in the first place like this: So why do you use the non-capturing group anyway? the reason for using the non-capturing group is to save memory, as the regex engine doesn’t need to store the groups in the buffer.
In groups, we divide the whole pattern into several parts which we can say that we are grouping it. Then, we write regex patterns for each part which are known as groups. The convention for specifying a group is that we write them within the parenthesis. Let’s see an example to understand groups.
The reason that you do not get an optional cat
after a reluctantly-qualified .+?
is that it is both optional and non-anchored: the engine is not forced to make that match, because it can legally treat the cat
as the "tail" of the .+?
sequence.
If you anchor the cat at the end of the string, i.e. use ^(dog).+?(cat)?$
, you would get a match, though:
Pattern p = Pattern.compile("^(dog).+?(cat)?$"); for (String s : new String[] {"dog, cat", "dog, dog, cat", "dog, dog, dog"}) { Matcher m = p.matcher(s); if (m.find()) { System.out.println(m.group(1)+" "+m.group(2)); } }
This prints (demo 1)
dog cat dog cat dog null
Do you happen to know how to deal with it in case there's something after cat?
You can deal with it by constructing a trickier expression that matches anything except cat
, like this:
^(dog)(?:[^c]|c[^a]|ca[^t])+(cat)?
Now the cat
could happen anywhere in the string without an anchor (demo 2).
Without any particular order, other options to match such patterns are:
With non-capturing groups:
^(?:dog(?:, |$))+(?:cat)?$
Or with capturing groups:
^(dog(?:, |$))+(cat)?$
With lookarounds,
(?<=^|, )dog|cat(?=$|,)
With word boundaries,
(?<=^|, )\b(?:dog|cat)\b(?=$|,)
If we would have had only one cat
and no dog
in the string, then
^(?:dog(?:, |$))*(?:cat)?$
would have been an option too.
import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegularExpression{ public static void main(String[] args){ final String regex = "^(?:dog(?:, |$))*(?:cat)?$"; final String string = "cat\n" + "dog, cat\n" + "dog, dog, cat\n" + "dog, dog, dog\n" + "dog, dog, dog, cat\n" + "dog, dog, dog, dog, cat\n" + "dog, dog, dog, dog, dog\n" + "dog, dog, dog, dog, dog, cat\n" + "dog, dog, dog, dog, dog, dog, dog, cat\n" + "dog, dog, dog, dog, dog, dog, dog, dog, dog\n"; final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE); final Matcher matcher = pattern.matcher(string); while (matcher.find()) { System.out.println("Full match: " + matcher.group(0)); for (int i = 1; i <= matcher.groupCount(); i++) { System.out.println("Group " + i + ": " + matcher.group(i)); } } } }
Full match: cat Full match: dog, cat Full match: dog, dog, cat Full match: dog, dog, dog Full match: dog, dog, dog, cat Full match: dog, dog, dog, dog, cat Full match: dog, dog, dog, dog, dog Full match: dog, dog, dog, dog, dog, cat Full match: dog, dog, dog, dog, dog, dog, dog, cat Full match: dog, dog, dog, dog, dog, dog, dog, dog, dog
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
jex.im visualizes regular expressions:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With