Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange issue with `(.*)*`, `(.*)+`, `(.+)*` in Java regex

Tags:

java

regex

In order to re-produce the problem as stated in a recent question - Why does (.*)* make two matches and select nothing in group $1? I tried various combination of * and +, inside and outside the brackets, and the result I got was not expected.

I would have expected the output, same as one explained in the accepted answer in that question, and also in another duplicate question, tagged under Perl - Why doesn't the .* consume the entire string in this Perl regex? . But it's not behaving the same way.

To make it simple, here's the code I tried: -

String str = "input";
String[] patterns = { "(.*)*", "(.*)+", "(.+)*", "(.+)+" };

for (String pattern: patterns) {
    Matcher matcher = Pattern.compile(pattern).matcher(str);

    while (matcher.find()) {
        System.out.print("'" + matcher.group(1) + "' : '" + matcher.start() + "'" + "\t");
    }

    System.out.println();
}

And this is the output I got for all the 4 combination: -

'' : '0'    '' : '5'            // For `(.*)*`
'' : '0'    '' : '5'            // For `(.*)+`  
'input' : '0'   'null' : '5'    // For `(.+)*`
'input' : '0'                   // For `(.+)+`

Now, What I can't understand, why in 1st and 2nd output, I am not getting the entire string as first result for matcher.find(). I mean, ideally, in 1st case, .* should first capture the entire string, and then also capture the empty string at the end. Now, although it is giving expected result for 2nd match, it's not behaving well for 1st match.

And also, in 2nd case, I should not even get the 2nd match, because I'm having a + quantifier outside the bracket.

My expected output is: -

'input' : '0'   '' : '5'  // For 1st
'input' : '0'    // For 2nd

Also, in the 3rd output, why I got null as 2nd match instead of empty string? Shouldn't the 2nd match for first 3 combination be same?

4th output is as per expectation. So, no doubt in that.

like image 926
Rohit Jain Avatar asked Jan 24 '13 11:01

Rohit Jain


1 Answers

You're seeing the effect of the same phenomenon you see in the question you linked to:

For (.*)*:

  • The first matcher.start() is 0 because that's where the match ("input") starts.
  • The first matcher.group(1) is "" because the repeated (.*) has overwritten the captured "input" with the empty string (but matcher.group(0) does contain input").
  • The second matcher.start() is 5 because that's where the regex engine is after the first successful match.
  • The second matcher.group(1) (as well as matcher.group(0)) is "" because that's all there was to match at the end of the string.

For (.*)+ it's the same. After all, the empty string can be repeated as many times as you want and still be the empty string.

For (.+)* you get null because while the second match succeeds (zero repetitions of a string of length 1 matches the empty string), the capturing parentheses haven't been able to capture anything, so its contents are null (as in undefined, instead of the empty string).

like image 200
Tim Pietzcker Avatar answered Sep 19 '22 15:09

Tim Pietzcker