If you're looking for a space, that would be " " (one space). If you're looking for one or more, it's " *" (that's two spaces and an asterisk) or " +" (one space and a plus).
Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.
The Java regex pattern \\s+ is used to match multiple whitespace characters when applying a regex search to your specified value. The pattern is a modified version of \\s which is used to match a single whitespace character. The difference is easy to see with an example.
\\s - matches single whitespace character. \\s+ - matches sequence of one or more whitespace characters.
You can’t use \s
in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.
Unicode defines 26 code points as \p{White_Space}
: 20 of them are various sorts of \pZ
GeneralCategory=Separator, and the remaining 6 are \p{Cc}
GeneralCategory=Control.
White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:
String whitespace_chars = "" /* dummy empty string for homogeneity */
+ "\\u0009" // CHARACTER TABULATION
+ "\\u000A" // LINE FEED (LF)
+ "\\u000B" // LINE TABULATION
+ "\\u000C" // FORM FEED (FF)
+ "\\u000D" // CARRIAGE RETURN (CR)
+ "\\u0020" // SPACE
+ "\\u0085" // NEXT LINE (NEL)
+ "\\u00A0" // NO-BREAK SPACE
+ "\\u1680" // OGHAM SPACE MARK
+ "\\u180E" // MONGOLIAN VOWEL SEPARATOR
+ "\\u2000" // EN QUAD
+ "\\u2001" // EM QUAD
+ "\\u2002" // EN SPACE
+ "\\u2003" // EM SPACE
+ "\\u2004" // THREE-PER-EM SPACE
+ "\\u2005" // FOUR-PER-EM SPACE
+ "\\u2006" // SIX-PER-EM SPACE
+ "\\u2007" // FIGURE SPACE
+ "\\u2008" // PUNCTUATION SPACE
+ "\\u2009" // THIN SPACE
+ "\\u200A" // HAIR SPACE
+ "\\u2028" // LINE SEPARATOR
+ "\\u2029" // PARAGRAPH SEPARATOR
+ "\\u202F" // NARROW NO-BREAK SPACE
+ "\\u205F" // MEDIUM MATHEMATICAL SPACE
+ "\\u3000" // IDEOGRAPHIC SPACE
;
/* A \s that actually works for Java’s native character set: Unicode */
String whitespace_charclass = "[" + whitespace_chars + "]";
/* A \S that actually works for Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";
Now you can use whitespace_charclass + "+"
as the pattern in your replaceAll
.
Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.
And if you think white space is bad, you should see what you have to do to get \w
and \b
to finally behave properly!
Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.
If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.
Yeah, you need to grab the result of matcher.replaceAll()
:
String result = matcher.replaceAll(" ");
System.out.println(result);
For Java (not php, not javascript, not anyother):
txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")
when I sended a question to a Regexbuddy (regex developer application) forum, I got more exact reply to my \s Java question:
"Message author: Jan Goyvaerts
In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).
... \s\s will match two spaces, if the input is ASCII only. The real problem is with the OP's code, as is pointed out by the accepted answer in that question."
Seems to work for me:
String s = " a b c";
System.out.println("\"" + s.replaceAll("\\s\\s", " ") + "\"");
will print:
" a b c"
I think you intended to do this instead of your code:
Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
result = matcher.replaceAll(" ");
}
System.out.println(result);
Java has evolved since this issue was first brought up. You can match all manner of unicode space characters by using the \p{Zs}
group.
Thus if you wanted to replace one or more exotic spaces with a plain space you could do this:
String txt = "whatever my string is";
String newTxt = txt.replaceAll("\\p{Zs}+", " ");
Also worth knowing, if you've used the trim()
string function you should take a look at the (relatively new) strip()
, stripLeading()
, and stripTrailing()
functions on strings. They can help you trim off all sorts of squirrely white space characters. For more information on what what space is included, see Java's Character.isWhitespace()
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With