Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why String.replaceAll() in java requires 4 slashes "\\\\" in regex to actually replace "\"?

I recently noticed that, String.replaceAll(regex,replacement) behaves very weirdly when it comes to the escape-character "\"(slash)

For example consider there is a string with filepath - String text = "E:\\dummypath" and we want to replace the "\\" with "/".

text.replace("\\","/") gives the output "E:/dummypath" whereas text.replaceAll("\\","/") raises the exception java.util.regex.PatternSyntaxException.

If we want to implement the same functionality with replaceAll() we need to write it as, text.replaceAll("\\\\","/")

One notable difference is replaceAll() has its arguments as reg-ex whereas replace() has arguments character-sequence!

But text.replaceAll("\n","/") works exactly the same as its char-sequence equivalent text.replace("\n","/")

Digging Deeper: Even more weird behaviors can be observed when we try some other inputs.

Lets assign text="Hello\nWorld\n"

Now, text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") all these three gives the same output Hello/World/

Java had really messed up with the reg-ex in its best possible way I feel! No other language seems to have these playful behaviors in reg-ex. Any specific reason, why Java messed up like this?

like image 804
Bharath Avatar asked Sep 18 '13 15:09

Bharath


People also ask

Does replaceAll replace string?

replaceAll() The replaceAll() method returns a new string with all matches of a pattern replaced by a replacement . The pattern can be a string or a RegExp , and the replacement can be a string or a function to be called for each match. The original string is left unchanged.

Does replaceAll use regex?

replaceAll() The method replaceAll() replaces all occurrences of a String in another String matched by regex. This is similar to the replace() function, the only difference is, that in replaceAll() the String to be replaced is a regex while in replace() it is a String.

What does public string replaceAll replace?

public String replaceAll(String regex, String replacement) The replaceAll() method replaces each substring of this string that matches the given regular expression with the given replacement.

What is replaceAll \\ s in Java?

Java String replaceAll() The replaceAll() method replaces each substring that matches the regex of the string with the specified text.


3 Answers

You need to esacpe twice, once for Java, once for the regex.

Java code is

"\\\\" 

makes a regex string of

"\\" - two chars 

but the regex needs an escape too so it turns into

\ - one symbol 
like image 139
Peter Lawrey Avatar answered Sep 21 '22 16:09

Peter Lawrey


@Peter Lawrey's answer describes the mechanics. The "problem" is that backslash is an escape character in both Java string literals, and in the mini-language of regexes. So when you use a string literal to represent a regex, there are two sets of escaping to consider ... depending on what you want the regex to mean.

But why is it like that?

It is a historical thing. Java originally didn't have regexes at all. The syntax rules for Java String literals were borrowed from C / C++, which also didn't have built-in regex support. Awkwardness of double escaping didn't become apparent in Java until they added regex support in the form of the Pattern class ... in Java 1.4.

So how do other languages manage to avoid this?

They do it by providing direct or indirect syntactic support for regexes in the programming language itself. For instance, in Perl, Ruby, Javascript and many other languages, there is a syntax for patterns / regexs (e.g. '/pattern/') where string literal escaping rules do not apply. In C# and Python, they provide an alternative "raw" string literal syntax in which backslashes are not escapes. (But note that if you use the normal C# / Python string syntax, you have the Java problem of double escaping.)


Why do text.replaceAll("\n","/"), text.replaceAll("\\n","/"), and text.replaceAll("\\\n","/") all give the same output?

The first case is a newline character at the String level. The Java regex language treats all non-special characters as matching themselves.

The second case is a backslash followed by an "n" at the String level. The Java regex language interprets a backslash followed by an "n" as a newline.

The final case is a backslash followed by a newline character at the String level. The Java regex language doesn't recognize this as a specific (regex) escape sequence. However in the regex language, a backslash followed by any non-alphabetic character means the latter character. So, a backslash followed by a newline character ... means the same thing as a newline.

like image 23
Stephen C Avatar answered Sep 19 '22 16:09

Stephen C


1) Let's say you want to replace a single \ using Java's replaceAll method:

   \
   ˪--- 1) the final backslash

2) Java's replaceAll method takes a regex as first argument. In a regex literal, \ has a special meaning, e.g. in \d which is a shortcut for [0-9] (any digit). The way to escape a metachar in a regex literal is to precede it with a \, which leads to:

 \ \
 | ˪--- 1) the final backslash
 |
 ˪----- 2) the backslash needed to escape 1) in a regex literal

3) In Java, there is no regex literal: you write a regex in a string literal (unlike JavaScript for example, where you can write /\d+/). But in a string literal, \ also has a special meaning, e.g. in \n (a new line) or \t (a tab). The way to escape a metachar in a string literal is to precede it with a \, which leads to:

\\\\
|||˪--- 1) the final backslash
||˪---- 3) the backslash needed to escape 1) in a string literal
|˪----- 2) the backslash needed to escape 1) in a regex literal
˪------ 3) the backslash needed to escape 2) in a string literal
like image 32
sp00m Avatar answered Sep 21 '22 16:09

sp00m