In my Java 8 app, I am scanning for whitespaces in text passed in. But \s
in my Regular Expression doesn't capture all whitespaces. The one whitespace that I've found that it doesn't capture so far in my testing is Non-breaking Space (Unicode 00A0). This was my regular expression that was running into that issue:
Pattern p = Pattern.compile("\\s");
To solve this, I added \h
to my Regular Expression:
Pattern p = Pattern.compile("[\\s\\h]");
Now, are there any other whitespaces that I need to be aware of that wont be captured by \s\h
?
Yes, for your case a space works. \s matches any whitespace character (spaces, tabs, carriage returns, new lines...)
\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.
Space, tab, line feed (newline), carriage return, form feed, and vertical tab characters are called "white-space characters" because they serve the same purpose as the spaces between words and lines on a printed page — they make reading easier.
By default, \s
only matches ASCII whitespace characters ([ \t\n\x0B\f\r]
). There are two ways to overcome this limitation
Use Unicode character properties: Pattern.compile("\\p{IsWhiteSpace}")
Make the predefined character class use Unicode properties:Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS)
This can also be enabled via the embedded flag (?U)
Pattern[] pattern = {
Pattern.compile("\\s"),
Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS),
Pattern.compile("((?U)\\s)"),
Pattern.compile("\\p{IsWhiteSpace}")
};
String s = " \t\n\u00A0\u2002\u2003\u2006\u202F";
for(Pattern p: pattern) {
int count = 0;
for(Matcher m = p.matcher(s); m.find(); ) count++;
System.out.printf("%-19s: %d matches%n",
p.pattern()+((p.flags()&Pattern.UNICODE_CHARACTER_CLASS)!=0? " [(?U) via flags]": ""),
count);
}
\s : 3 matches
\s [(?U) via flags]: 8 matches
((?U)\s) : 8 matches
\p{IsWhiteSpace} : 8 matches
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With