Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Zero-length matches in Java Regex

Tags:

java

regex

My code :

Pattern pattern = Pattern.compile("a?");
Matcher matcher = pattern.matcher("ababa");
while(matcher.find()){
   System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}

Output :

0[a]1
1[]1
2[a]3
3[]3
4[a]5
5[]5

What I know :

  • "a?" stands for zero or one occurrence of the character 'a'.

Java API says :

  • matcher.start() returns the start index of the previous match.
  • matcher.end() returns the offset after the last character matched.
  • matcher.group() returns the input subsequence matched by the previous match. For a matcher m with input sequence s, the expressions m.group() and s.substring(m.start(), m.end()) are equivalent. And for some patterns, for example a*, match the empty string. This method will return the empty string when the pattern successfully matches the empty string in the input.

What I want to know:

  1. In which situations does the regex engine encounters a zero occurrence of a given character(s) - Here for character 'a'.
  2. In those situation what are values actually returns by the start(), end() and group() methods in the matcher. I have mentioned what the java API said. But I'm little unclear when it comes to the practical situation as above.
like image 348
namalfernandolk Avatar asked Mar 28 '12 11:03

namalfernandolk


2 Answers

The ? is a greedy quantifier, therefore it will first try to match the 1-occurence before trying the 0-occurence. In you string,

  1. it starts with the first char 'a' and tries to match agains the 1-occurence. The 'a' char matches and so it returns the first result you see
  2. then it moves forward and find a 'b'. The 'b' char does not match your regexp 1-occurence, so the engine backtracks and attempt to match a 0-occurence. Result is that the empty string is matched--> you get your second result.
  3. then it moves ahead of b since no more matches are possible there and it starts again with your second 'a' char.
  4. etc... you get the point...

It is a bit more complicated than that but that is the main idea. When the 1-occurence cannot match, it will then try with the 0-occurence.

As for the values of start, end and group, they will be where the match starts, ends and the group is what has been matched, so in the first 0-occurence match of your string, you get 1, 1 and the emtpy string. I am not sure this really answers your question.

like image 69
Guillaume Polet Avatar answered Oct 18 '22 06:10

Guillaume Polet


Iterating over few examples would clear out the functioning of matcher.find() to you :

Regex engine takes on one character from string (i.e. ababa) and tries to find if pattern you are seeking in string could be found or not. If the pattern exists, then (as API mentioned) :

matcher.start() returns the starting index, matcher.end() returns the offset after the last character matched.

If match do not exists. then start() and end() returns the same index, which is to comply the length matched is zero.

Look down following examples :

        // Searching for string either "a" or ""
        Pattern pattern = Pattern.compile("a?");
        Matcher matcher = pattern.matcher("abaabbbb");
        while(matcher.find()){
           System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
        }

Output:

    0[a]1
    1[]1
    2[a]3
    3[a]4
    4[]4
    5[]5
    6[]6
    7[]7
    8[]8


      // Searching for string either "aa" or "a"
       Pattern pattern = Pattern.compile("aa?");
    Matcher matcher = pattern.matcher("abaabbbb");
    while(matcher.find()){
       System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
    }

Output:

0[a]1
2[aa]4
like image 37
Rohit Bansal Avatar answered Oct 18 '22 07:10

Rohit Bansal