Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String.replaceAll(regex) makes the same replacement twice

Tags:

java

regex

Can anyone tell me why

System.out.println("test".replaceAll(".*", "a")); 

Results in

aa 

Note that the following has the same result:

System.out.println("test".replaceAll(".*$", "a")); 

I have tested this on java 6 & 7 and both seem to behave the same way. Am I missing something or is this a bug in the java regex engine?

like image 508
nablex Avatar asked Dec 22 '11 13:12

nablex


People also ask

Does replaceAll replace string?

replaceAll() The replaceAll() method returns a new string with all matches of a pattern replaced by a replacement . The pattern can be a string or a RegExp , and the replacement can be a string or a function to be called for each match. The original string is left unchanged.

What is the difference between Replace () and replaceAll ()?

The difference between replace() and replaceAll() method is that the replace() method replaces all the occurrences of old char with new char while replaceAll() method replaces all the occurrences of old string with the new string.

What does public string replaceAll replace?

public String replaceAll(String regex, String replacement) The replaceAll() method replaces each substring of this string that matches the given regular expression with the given replacement.


1 Answers

This is not an anomaly: .* can match anything.

You ask to replace all occurrences:

  • the first occurrence does match the whole string, the regex engine therefore starts from the end of input for the next match;
  • but .* also matches an empty string! It therefore matches an empty string at the end of the input, and replaces it with a.

Using .+ instead will not exhibit this problem since this regex cannot match an empty string (it requires at least one character to match).

Or, use .replaceFirst() to only replace the first occurrence:

"test".replaceFirst(".*", "a")        ^^^^^^^^^^^^ 

Now, why .* behaves like it does and does not match more than twice (it theoretically could) is an interesting thing to consider. See below:

# Before first run regex: |.* input: |whatever # After first run regex: .*| input: whatever| #before second run regex: |.* input: whatever| #after second run: since .* can match an empty string, it it satisfied... regex: .*| input: whatever| # However, this means the regex engine matched an empty input. # All regex engines, in this situation, will shift # one character further in the input. # So, before third run, the situation is: regex: |.* input: whatever<|ExhaustionOfInput> # Nothing can ever match here: out 

Note that, as @A.H. notes in the comments, not all regex engines behave this way. GNU sed for instance will consider that it has exhausted the input after the first match.

like image 146
fge Avatar answered Oct 07 '22 02:10

fge