Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

can't understand String.replaceAll non-greedy behavior [duplicate]

Tags:

java

regex

Possible Duplicate:
Java regex anomaly?

any Idea why the following test fails (returns "xx" instead of "x")

@Test 
public void testReplaceAll(){
    assertEquals("x", "xyz".replaceAll(".*", "x"));
}

I don't want to do "^.*$".... I want to understand this behavior. any clues?

like image 738
ekeren Avatar asked Dec 29 '11 19:12

ekeren


1 Answers

Yes, it is exactly the same as described in this question!

.* will first match the whole input, but then also an empty string at the end of the input...

Let's symbolize the regex engine with | and the input with <...> in your example.

  • input: <xyz>;
  • regex engine, before first run: <|xyz>;
  • regex engine, after first run: <xyz|> (matched text: "xyz");
  • regex engine, after second run: <xyz>| (matched text: "").

Not all regex engines behave this way. Java does, however. So does perl. Sed, as a counterexample, will position its cursor after the end of the input in step 3.

Now, you also have to understand one crucial thing: regex engines, when they encounter a zero-length match, always advance one character. Otherwise, consider what would happen if you attempted to replace '^' with 'a': '^' matches a position, therefore is a zero-length match. If the engine didn't advance one character, "x" would be replaced with "ax", which would be replace with "aax", etc. So, after the second match, which is empty, Java's regex engine advances one "character"... Of which there aren't any: end of processing.

like image 88
fge Avatar answered Sep 29 '22 23:09

fge