In the Oracle Pattern documentation there is the description of three different pattern for matching whitespace : <ol> <li>\s</li> <li>\p{Space}</li> <li>\p{javaWhitespace}</li> </ol> I'm wondering what are the specificity of each and how to know how to choose the right one. I've just noticed that <code>\p{javaWhitespace}</code> include more space type.

<code>\s</code> is the shortest and also the most non-portable option to specify a space character. Although it is rare to port Java code to other languages, it is more about porting the knowledge of the syntax of one regex engine to another. There are many regex engines using Perl like syntax, so difference in interpretation for the same syntax like <code>\s</code> confuses the programmers. Apart from space (ASCII 32), new line (<code>\n</code>, ASCII 10), horizontal tab (<code>\t</code>, ASCII 9), carriage return (<code>\r</code>, ASCII 13) and form feed (<code>\f</code>, ASCII 12), there is no consensus between different engines of what is a space character. <ul> <li>Java, POSIX (ASCII): Also includes vertical tab (ASCII 11). Java seems to follow POSIX standard here.</li> <li> JavaScript (Edition 5.1): According to the specs (word by word), apart from the 5 common ones, it includes: <ul> <li> Unicode category Zs (Separator/Space), <code>\u2028</code> (Line Separator), <code>\u2029</code> (Paragraph Separator). It basically includes all characters under category Z (Separator). Actually <code>\u2028</code> is the sole member of category Zl (Separator/Line), and <code>\u2029</code> is the sole member of category Zp (Separator/Paragraph). By the wording, it might be possible that the current version of the specs exclude any further extension to those 2 category. </li> <li>Vertical tab <code>\v</code> </li> <li> Byte-Order Mark a.k.a. ZERO WIDTH NO-BREAK SPACE <code>\ufeff</code> </li> </ul> </li> <li>Perl, PCRE (ASCII mode): Vertical tab <code>\v</code> added from Perl 5.18 as experiment. Before 5.18, it only matches the 5 common ones.</li> <li> Perl (Unicode mode): Apart from the 5 common ones <ul> <li>Unicode category Z (Separator)</li> <li>Vertical tab <code>\v</code> added from Perl 5.18 as experiment.</li> <li>NEXT LINE (NEL) <code>\u0085</code> </li> <li> MONGOLIAN VOWEL SEPARATOR <code>\u180e</code> </li> </ul> </li> <li> .NET (default): Apart from 5 common ones <ul> <li>Unicode category Z (Separator)</li> <li>Vertical tab <code>\v</code> </li> <li>NEXT LINE (NEL) <code>\u0085</code> </li> </ul> </li> <li> Java (Unicode): From Java 7, Pattern class includes a new flag <code>UNICODE_CHARACTER_CLASS</code> which makes Predefined character classes and POSIX character classes conform to Unicode Technical Standard #18: Unicode Regular Expression. When the flag is active, Predefined character class and the corresponding POSIX character class will become equivalent (match the same thing). The list of characters is the same as .NET's. </li> </ul> That is enough to drive one crazy! <hr> <code>\p{Space}</code> is the more "stable" option since it follows the POSIX standard in default mode, and Unicode Technical Standard #18: Unicode Regular Expression in <code>UNICODE_CHARACTER_CLASS</code>. If you use POSIX character class, POSIX-compliant implementation will have the same behavior in ASCII mode, and Unicode regex engines which follow the recommendation will have the (almost) the same behavior in Unicode mode. <code>\s</code> and <code>\p{Space}</code> are equivalent in Java, regardless of the flag. If you use <code>\s</code> in Java, you can be sure you are following some standard/recommendation. Just that it does not announce to most programmers about this fact. <hr> <code>\p{isJavaWhitespace}</code> to match whitespace according to Java's definition. The name of the function is extremely misleading.

How to choose between whitespace pattern?

1 Answers

\s is the shortest and also the most non-portable option to specify a space character. Although it is rare to port Java code to other languages, it is more about porting the knowledge of the syntax of one regex engine to another. There are many regex engines using Perl like syntax, so difference in interpretation for the same syntax like \s confuses the programmers.

Apart from space (ASCII 32), new line (\n, ASCII 10), horizontal tab (\t, ASCII 9), carriage return (\r, ASCII 13) and form feed (\f, ASCII 12), there is no consensus between different engines of what is a space character.

Java, POSIX (ASCII): Also includes vertical tab (ASCII 11). Java seems to follow POSIX standard here.
JavaScript (Edition 5.1): According to the specs (word by word), apart from the 5 common ones, it includes:
- Unicode category Zs (Separator/Space), \u2028 (Line Separator), \u2029 (Paragraph Separator). It basically includes all characters under category Z (Separator).
  
  ^{Actually \u2028 is the sole member of category Zl (Separator/Line), and \u2029 is the sole member of category Zp (Separator/Paragraph). By the wording, it might be possible that the current version of the specs exclude any further extension to those 2 category.}
- Vertical tab \v
- Byte-Order Mark a.k.a. ZERO WIDTH NO-BREAK SPACE \ufeff
Perl, PCRE (ASCII mode): Vertical tab \v added from Perl 5.18 as experiment. Before 5.18, it only matches the 5 common ones.
Perl (Unicode mode): Apart from the 5 common ones
- Unicode category Z (Separator)
- Vertical tab \v added from Perl 5.18 as experiment.
- NEXT LINE (NEL) \u0085
- MONGOLIAN VOWEL SEPARATOR \u180e
.NET (default): Apart from 5 common ones
- Unicode category Z (Separator)
- Vertical tab \v
- NEXT LINE (NEL) \u0085
Java (Unicode): From Java 7, Pattern class includes a new flag UNICODE_CHARACTER_CLASS which makes Predefined character classes and POSIX character classes conform to Unicode Technical Standard #18: Unicode Regular Expression. When the flag is active, Predefined character class and the corresponding POSIX character class will become equivalent (match the same thing).

The list of characters is the same as .NET's.

That is enough to drive one crazy!

\p{Space} is the more "stable" option since it follows the POSIX standard in default mode, and Unicode Technical Standard #18: Unicode Regular Expression in UNICODE_CHARACTER_CLASS.

If you use POSIX character class, POSIX-compliant implementation will have the same behavior in ASCII mode, and Unicode regex engines which follow the recommendation will have the (almost) the same behavior in Unicode mode.

\s and \p{Space} are equivalent in Java, regardless of the flag. If you use \s in Java, you can be sure you are following some standard/recommendation. Just that it does not announce to most programmers about this fact.

\p{isJavaWhitespace} to match whitespace according to Java's definition. The name of the function is extremely misleading.

answered Sep 21 '22 22:09

nhahtdh

Related questions
                            
                                static imports method overlap
                            
                                OSX: JavaVM, AWT/Swing and possibly a deadlock
                            
                                Stanford's Karel the Robot throws NullPointerException
                            
                                Java: extracting interfaces just for testing
                            
                                Looking for a way to retrieve date in velocity template
                            
                                How can I run common code for most requests in my Spring MVC Web App?
                            
                                How to align image to center of table cell (SWT Table)
                            
                                Passing authorization header for oauth token request
                            
                                CharBuffer.put() didn't working
                            
                                What is the difference between new Double(someString) and Double.parseDouble(someString)
                            
                                SLF4J logging to file vs. DB vs. Solr
                            
                                Why do we have to use an intermediary variable for @SuppressWarnings("unchecked")?
                            
                                How to hide the default minimize/maximize and close buttons on JFrame window in Java?
                            
                                How to keep a ScrollView to be scrolled to the Bottom?
                            
                                Java - Typecasting from Java.lang.Object to an instance of a custom Class
                            
                                why is this Spring AOP pointcut not triggered?
                            
                                Card Game Player Class OOP Design
                            
                                Writing a query with QueryDSL JPA with many to many mapping
                            
                                Java SWT: widgetSelected vs widgetDefaultSelected
                            
                                Drawing rectangle on a JPanel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to choose between whitespace pattern?

Tags:

java

regex

alain.janinm

People also ask

1 Answers

nhahtdh

Recent Activity

Donate For Us