Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

\s doesn't actually capture all whitespace characters

In my Java 8 app, I am scanning for whitespaces in text passed in. But \s in my Regular Expression doesn't capture all whitespaces. The one whitespace that I've found that it doesn't capture so far in my testing is Non-breaking Space (Unicode 00A0). This was my regular expression that was running into that issue:

Pattern p = Pattern.compile("\\s");

To solve this, I added \h to my Regular Expression:

Pattern p = Pattern.compile("[\\s\\h]");

Now, are there any other whitespaces that I need to be aware of that wont be captured by \s\h?

like image 895
Jack Cole Avatar asked Jun 18 '19 16:06

Jack Cole


People also ask

Does \S includes whitespace or not?

Yes, for your case a space works. \s matches any whitespace character (spaces, tabs, carriage returns, new lines...)

What does \S indicate in regex?

\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.

What characters count as whitespace?

Space, tab, line feed (newline), carriage return, form feed, and vertical tab characters are called "white-space characters" because they serve the same purpose as the spaces between words and lines on a printed page — they make reading easier.


Video Answer


1 Answers

By default, \s only matches ASCII whitespace characters ([ \t\n\x0B\f\r]). There are two ways to overcome this limitation

  1. Use Unicode character properties: Pattern.compile("\\p{IsWhiteSpace}")

  2. Make the predefined character class use Unicode properties:
    Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS)
    This can also be enabled via the embedded flag (?U)

Pattern[] pattern = {
    Pattern.compile("\\s"),
    Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS),
    Pattern.compile("((?U)\\s)"),
    Pattern.compile("\\p{IsWhiteSpace}")
};
String s = " \t\n\u00A0\u2002\u2003\u2006\u202F";
for(Pattern p: pattern) {
    int count = 0;
    for(Matcher m = p.matcher(s); m.find(); ) count++;
    System.out.printf("%-19s: %d matches%n",
      p.pattern()+((p.flags()&Pattern.UNICODE_CHARACTER_CLASS)!=0? " [(?U) via flags]": ""),
      count);
}
\s                 : 3 matches
\s [(?U) via flags]: 8 matches
((?U)\s)           : 8 matches
\p{IsWhiteSpace}   : 8 matches
like image 78
Holger Avatar answered Nov 11 '22 19:11

Holger