In the Oracle Pattern documentation there is the description of three different pattern for matching whitespace :
I'm wondering what are the specificity of each and how to know how to choose the right one.
I've just noticed that \p{javaWhitespace}
include more space type.
Pattern details:\s* - 0+ whitespaces. = - a literal = (\S*) - Group 1 capturing 0+ chars other than whitespace (or \S+ can be used to match 1 or more chars other than whitespace).
The \s metacharacter matches whitespace character. Whitespace characters can be: A space character.
[^ ] matches anything but a space character.
In regular expressions, the hyphen ("-") notation has special meaning; it indicates a range that would match any number from 0 to 9. As a result, you must escape the "-" character with a forward slash ("\") when matching the literal hyphens in a social security number.
\s
is the shortest and also the most non-portable option to specify a space character. Although it is rare to port Java code to other languages, it is more about porting the knowledge of the syntax of one regex engine to another. There are many regex engines using Perl like syntax, so difference in interpretation for the same syntax like \s
confuses the programmers.
Apart from space (ASCII 32), new line (\n
, ASCII 10), horizontal tab (\t
, ASCII 9), carriage return (\r
, ASCII 13) and form feed (\f
, ASCII 12), there is no consensus between different engines of what is a space character.
Java, POSIX (ASCII): Also includes vertical tab (ASCII 11). Java seems to follow POSIX standard here.
JavaScript (Edition 5.1): According to the specs (word by word), apart from the 5 common ones, it includes:
Unicode category Zs (Separator/Space), \u2028
(Line Separator), \u2029
(Paragraph Separator). It basically includes all characters under category Z (Separator).
Actually \u2028
is the sole member of category Zl (Separator/Line), and \u2029
is the sole member of category Zp (Separator/Paragraph). By the wording, it might be possible that the current version of the specs exclude any further extension to those 2 category.
\v
\ufeff
Perl, PCRE (ASCII mode): Vertical tab \v
added from Perl 5.18 as experiment. Before 5.18, it only matches the 5 common ones.
Perl (Unicode mode): Apart from the 5 common ones
\v
added from Perl 5.18 as experiment.\u0085
\u180e
.NET (default): Apart from 5 common ones
\v
\u0085
Java (Unicode): From Java 7, Pattern class includes a new flag UNICODE_CHARACTER_CLASS
which makes Predefined character classes and POSIX character classes conform to Unicode Technical Standard #18: Unicode Regular Expression. When the flag is active, Predefined character class and the corresponding POSIX character class will become equivalent (match the same thing).
The list of characters is the same as .NET's.
That is enough to drive one crazy!
\p{Space}
is the more "stable" option since it follows the POSIX standard in default mode, and Unicode Technical Standard #18: Unicode Regular Expression in UNICODE_CHARACTER_CLASS
.
If you use POSIX character class, POSIX-compliant implementation will have the same behavior in ASCII mode, and Unicode regex engines which follow the recommendation will have the (almost) the same behavior in Unicode mode.
\s
and \p{Space}
are equivalent in Java, regardless of the flag. If you use \s
in Java, you can be sure you are following some standard/recommendation. Just that it does not announce to most programmers about this fact.
\p{isJavaWhitespace}
to match whitespace according to Java's definition. The name of the function is extremely misleading.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With