Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java replaceAll with backreferences [duplicate]

Possible Duplicate:
String.replaceAll() anomaly with greedy quantifiers in regex

I was writing code that uses Matcher#replaceAll and found following result highly confusing:

Pattern.compile("(.*)").matcher("sample").replaceAll("$1abc");

Now, I would expect the output to be sampleabc but Java throws at me sampleabcabc.

Does anybody have any ideas why?

Now, sure, when I anchor the pattern (^(.*)$) the issue goes away. Still I don't know why the hell would replaceAll do a double replacement like that.

And to add insult to injury, following code:

Pattern.compile("(.*)").matcher("sample").replaceFirst("$1abc")

works as expected, returning just sampleabc.

like image 323
Wejn Avatar asked Jan 24 '13 23:01

Wejn


2 Answers

It looks like it's matching the empty string at the end of the input, for some reason. (I can see why it would match; I'm intrigued that it matches once and only once.)

If you change replaceAll("$1abc") to replaceAll("'$1'abc") the result is 'sample'abc''abc.

Note that if you change (.*) to (.+) then it works correctly, because it has to match at least one character.

The diagnosis is confirmed by this code:

Matcher matcher = Pattern.compile("(.*)").matcher("sample");

while (matcher.find()) {
    System.out.printf("%d to %d\r\n", 
                      matcher.start(), 
                      matcher.end());
}

... which outputs:

0 to 6
6 to 6
like image 144
Jon Skeet Avatar answered Oct 18 '22 17:10

Jon Skeet


There are two things going on here that explain why this happens:

  • (.*) will successfully match empty strings.
  • After a match succeeds, another match will be attempted one position after the end of the previous match.

So, after the entire string "sample" is matched, another match is attempted just after the e. Even though there are no characters left the match succeeds and a second replacement occurs.

Additional replacements do not occur because the regex engine will always move forward. Just after the last character is a valid starting index so the empty string will match once, but after the empty string is matched there are no more valid starting positions for the regex engine to attempt a match from.

As an alternative to adding a beginning of string anchor to your regex, you can modify your regex so it matches one or more character by changing (.*) to (.+).

like image 5
Andrew Clark Avatar answered Oct 18 '22 18:10

Andrew Clark