Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text cleaning and replacement: delete \n from a text in Java

Tags:

java

string

I'm cleaning an incoming text in my Java code. The text includes a lot of "\n", but not as in a new line, but literally "\n". I was using replaceAll() from the String class, but haven't been able to delete the "\n". This doesn't seem to work:

String string;
string = string.replaceAll("\\n", "");

Neither does this:

String string;
string = string.replaceAll("\n", "");

I guess this last one is identified as an actual new line, so all the new lines from the text would be removed.

Also, what would be an effective way to remove different patterns of wrong text from a String. I'm using regular expressions to detect them, stuff like HTML reserved characters, etc. and replaceAll, but everytime I use replaceAll, the whole String is read, right?

UPDATE: Thanks for your great answers. I' ve extended this question here:
Text replacement efficiency
I'm asking specifically about efficiency :D

like image 354
Fernando Briano Avatar asked Feb 12 '09 16:02

Fernando Briano


4 Answers

Hooknc is right. I'd just like to post a little explanation:

"\\n" translates to "\n" after the compiler is done (since you escape the backslash). So the regex engine sees "\n" and thinks new line, and would remove those (and not the literal "\n" you have).

"\n" translates to a real new line by the compiler. So the new line character is send to the regex engine.

"\\\\n" is ugly, but right. The compiler removes the escape sequences, so the regex engine sees "\\n". The regex engine sees the two backslashes and knows that the first one escapes it so that translates to checking for the literal characters '\' and 'n', giving you the desired result.

Java is nice (it's the language I work in) but having to think to basically double-escape regexes can be a real challenge. For extra fun, it seems StackOverflow likes to try to translate backslashes too.

like image 90
MBCook Avatar answered Nov 16 '22 03:11

MBCook


I think you need to add a couple more slashies...

String string;
string = string.replaceAll("\\\\n", "");

Explanation: The number of slashies has to do with the fact that "\n" by itself is a controlled character in Java.

So to get the real characters of "\n" somewhere we need to use "\n". Which if printed out with give us: "\"

You're looking to replace all "\n" in your file. But you're not looking to replace the control "\n". So you tried "\n" which will be converted into the characters "\n". Great, but maybe not so much. My guess is that the replaceAll method will actually create a Regular Expression now using the "\n" characters which will be misread as the control character "\n".

Whew, almost done.

Using replaceAll("\\n", "") will first convert "\\n" -> "\n" which will be used by the Regular Expression. The "\n" will then be used in the Regular Expression and actually represents your text of "\n". Which is what you're looking to replace.

like image 42
hooknc Avatar answered Nov 16 '22 02:11

hooknc


Instead of String.replaceAll(), which uses regular expressions, you might be better off using String.replace(), which does simple string substitution (if you are using at least Java 1.5).

String replacement = string.replace("\\n", "");

should do what you want.

like image 44
Avi Avatar answered Nov 16 '22 03:11

Avi


string = string.replaceAll(""+(char)10, " ");
like image 36
gattsbr Avatar answered Nov 16 '22 02:11

gattsbr