Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Matcher groups: Understanding The difference between "(?:X|Y)" and "(?:X)|(?:Y)"

Can anyone explain:

  1. Why the two patterns used below give different results? (answered below)
  2. Why the 2nd example gives a group count of 1 but says the start and end of group 1 is -1?
 public void testGroups() throws Exception
 {
  String TEST_STRING = "After Yes is group 1 End";
  {
   Pattern p;
   Matcher m;
   String pattern="(?:Yes|No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

  {
   Pattern p;
   Matcher m;

   String pattern="(?:Yes)|(?:No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

 }

Which gives the following output:

Pattern=(?:Yes|No)(.*)End  Found=true Group count=1 Start of group 1=9 End of group 1=21
Pattern=(?:Yes)|(?:No)(.*)End  Found=true Group count=1 Start of group 1=-1 End of group 1=-1
like image 512
user358795 Avatar asked Jun 04 '10 19:06

user358795


People also ask

What does matcher group do in Java?

Java Matcher group() Method The group method returns the matched input sequence captured by the previous match in the form of the string. This method returns the empty string when the pattern successfully matches the empty string in the input.

How does pattern and matcher work in Java?

Pattern matcher() method in Java with examples The matcher() method of this class accepts an object of the CharSequence class representing the input string and, returns a Matcher object which matches the given string to the regular expression represented by the current (Pattern) object.

What is the difference between matches and find in Java?

Difference between matches() and find() in Java Regex The matches() method returns true If the regular expression matches the whole text. If not, the matches() method returns false. Whereas find() search for the occurrence of the regular expression passes to Pattern.

Does Matcher class interprets the pattern in a string?

Matcher Class − A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like the Pattern class, Matcher defines no public constructors. You obtain a Matcher object by invoking the matcher() method on a Pattern object.


4 Answers

  1. The difference is that in the second pattern "(?:Yes)|(?:No)(.*)End", the concatenation ("X followed by Y" in "XY") has higher precedence than the choice ("Either X or Y" in "X|Y"), like multiplication has higher precedence than addition, so the pattern is equivalent to

    "(?:Yes)|(?:(?:No)(.*)End)"
    

    What you wanted to get is the following pattern:

    "(?:(?:Yes)|(?:No))(.*)End"
    

    This yields the same output as your first pattern.

    In your test, the second pattern has the group 1 at the (empty) range [-1, -1[ because that group did not match (the start -1 is included, the end -1 is excluded, making the half-open interval empty).

  2. A capturing group is a group that may capture input. If it captures, one also says it matches some substring of the input. If the regex contains choices, then not every capturing group may actually capture input, so there may be groups that do not match even if the regex matches.

  3. The group count, as returned by Matcher.groupCount(), is gained purely by counting the grouping brackets of capturing groups, irrespective of whether any of them could match on any given input. Your pattern has exactly one capturing group: (.*). This is group 1. The documentation states:

    (?:X)    X, as a non-capturing group
    

    and explains:

    Groups beginning with (? are either pure, non-capturing groups that do not capture text and do not count towards the group total, or named-capturing group.

    Whether any specific group matches on a given input, is irrelevant for that definition. E.g., in the pattern (Yes)|(No), there are two groups ((Yes) is group 1, (No) is group 2), but only one of them can match for any given input.

  4. The call to Matcher.find() returns true if the regex was matched on some substring. You can determine which groups matched by looking at their start: If it is -1, then the group did not match. In that case, the end is at -1, too. There is no built-in method that tells you how many capturing groups actually matched after a call to find() or match(). You'd have to count these yourself by looking at each group's start.

  5. When it comes to backreferences, also note what the regex tutorial has to say:

    There is a difference between a backreference to a capturing group that matched nothing, and one to a capturing group that did not participate in the match at all.

like image 102
Christian Semrau Avatar answered Sep 27 '22 15:09

Christian Semrau


To summarise,

1) The two patterns give different results because of the precedence rules of the operators.

  • (?:Yes|No)(.*)End matches (Yes or No) followed by .*End
  • (?:Yes)|(?:No)(.*)End matches (Yes) or (No followed by .*End)

2) The second pattern gives a group count of 1 but a start and end of -1 because of the (not necessarily intuitive) meanings of the results returned by the Matcher method calls.

  • Matcher.find() returns true if a match was found. In your case the match was on the (?:Yes) part of the pattern.
  • Matcher.groupCount() returns the number of capturing groups in the pattern regardless of whether the capturing groups actually participated in the match. In your case only the non capturing (?:Yes) part of the pattern participated in the match, but the capturing (.*) group was still part of the pattern so the group count is 1.
  • Matcher.start(n) and Matcher.end(n) return the start and end index of the subsequence matched by the n th capturing group. In your case, although an overall match was found, the (.*) capturing group did not participate in the match and so did not capture a subsequence, hence the -1 results.

3) (Question asked in comment.) In order to determine how many capturing groups actually captured a subsequence, iterate Matcher.start(n) from 0 to Matcher.groupCount() counting the number of non -1 results. (Note that Matcher.start(0) is the capturing group representing the whole pattern, which you may want to exclude for your purposes.)

like image 40
Hwee Avatar answered Sep 28 '22 15:09

Hwee


Due to the precedence of the "|" operator in the pattern, the second pattern is equivalent to:

(?:Yes)|((?:No)(.*)End)

What you want is

(?:(?:Yes)|(?:No))(.*)End
like image 34
jimr Avatar answered Sep 28 '22 15:09

jimr


When using regular expression is it important to remember there there is an implicit AND operator at work. This can be seen from the JavaDoc for java.util.regex.Pattern covering the logical operators:

Logical operators
XY X followed by Y
X|Y Either X or Y
(X) X, as a capturing group

This AND takes precedence over the OR in the second Pattern. The second Pattern is equivalent to
(?:Yes)|(?:(?:No)(.*)End).
In order for it to be equivalent to the first Pattern it must be changed to
(?:(?:Yes)|(?:No))(.*)End

like image 35
Jacob Tomaw Avatar answered Sep 27 '22 15:09

Jacob Tomaw