Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove all whitespaces from String but keep ONE newline

I have this input String (containg tabs, spaces, linebreaks):


        That      is a test.              
    seems to work       pretty good? working.








    Another test  again.

[Edit]: I should have provided the String for better testing as stackoverflow removes all special characters (tabs, ...)

String testContent = "\n\t\n\t\t\t\n\t\t\tDas      ist ein Test.\t\t\t  \n\tsoweit scheint das \t\tganze zu? funktionieren.\n\n\n\n\t\t\n\t\t\n\t\t\t      \n\t\t\t      \n    \t\t\t\n    \tNoch ein  Test.\n    \t\n    \t\n    \t";

And I want to reach this state:


That is a test.
seems to work pretty good? working.
Another test again.

String expectedOutput = "Das ist ein Test.\nsoweit scheint das ganze zu? funktionieren.\nNoch ein Test.\n";

Any ideas? Can this be achieved using regexes?

replaceAll("\\s+", " ") is NOT what I'm looking for. If this regex would preserve exactly 1 newline of the ones existing it would be perfect.

I have tried this but this seems suboptimal to me...:

BufferedReader bufReader = new BufferedReader(new StringReader(testContent));
String line = null;
StringBuilder newString = new StringBuilder();
while ((line = bufReader.readLine()) != null) {
    String temp = line.replaceAll("\\s+", " ");
    if (!temp.trim().equals("")) {
        newString.append(temp.trim());
        newString.append("\n");
    }
}
like image 898
friesoft Avatar asked Mar 19 '13 08:03

friesoft


2 Answers

In a single regex (plus a small patch for tabs):

input.replaceAll("^\\s+|\\s+$|\\s*(\n)\\s*|(\\s)\\s*", "$1$2")
     .replace("\t"," ");

The regex looks daunting, but in fact decomposes nicely into these parts that are OR-ed together:

  • ^\s+ – match whitespace at the beginning;
  • \s+$ – match whitespace at the end;
  • \s*(\n)\s* – match whitespace containing a newline, and capture that newline;
  • (\s)\s* – match whitespace, capturing the first whitespace character.

The result will be a match with two capture groups, but only one of the groups may be non-empty at a time. This allows me to replace the match with "$1$2", which means "concatenate the two capture groups."

The only remaining problem is that I can't replace a tab with a space using this approach, so I fix that up with a simple non-regex character replacement.

like image 118
Marko Topolnik Avatar answered Sep 21 '22 13:09

Marko Topolnik


In 4 steps:

text
    // 1. compress all non-newline whitespaces to single space
    .replaceAll("[\\s&&[^\\n]]+", " ")
    // 2. remove spaces from begining or end of lines
    .replaceAll("(?m)^\\s|\\s$", "")
    // 3. compress multiple newlines to single newlines
    .replaceAll("\\n+", "\n")
    // 4. remove newlines from begining or end of string
    .replaceAll("^\n|\n$", "") 
like image 37
MBO Avatar answered Sep 21 '22 13:09

MBO