Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to choose between whitespace pattern?

Tags:

java

regex

In the Oracle Pattern documentation there is the description of three different pattern for matching whitespace :

  1. \s
  2. \p{Space}
  3. \p{javaWhitespace}

I'm wondering what are the specificity of each and how to know how to choose the right one. I've just noticed that \p{javaWhitespace} include more space type.

like image 410
alain.janinm Avatar asked Feb 15 '12 10:02

alain.janinm


People also ask

How do you specify a pattern that captures one or more whitespace characters in Python?

Pattern details:\s* - 0+ whitespaces. = - a literal = (\S*) - Group 1 capturing 0+ chars other than whitespace (or \S+ can be used to match 1 or more chars other than whitespace).

What is the regex for whitespace?

The \s metacharacter matches whitespace character. Whitespace characters can be: A space character.

What is used to match anything except a whitespace?

[^ ] matches anything but a space character.

How do you escape a hyphen in regex?

In regular expressions, the hyphen ("-") notation has special meaning; it indicates a range that would match any number from 0 to 9. As a result, you must escape the "-" character with a forward slash ("\") when matching the literal hyphens in a social security number.


1 Answers

\s is the shortest and also the most non-portable option to specify a space character. Although it is rare to port Java code to other languages, it is more about porting the knowledge of the syntax of one regex engine to another. There are many regex engines using Perl like syntax, so difference in interpretation for the same syntax like \s confuses the programmers.

Apart from space (ASCII 32), new line (\n, ASCII 10), horizontal tab (\t, ASCII 9), carriage return (\r, ASCII 13) and form feed (\f, ASCII 12), there is no consensus between different engines of what is a space character.

  • Java, POSIX (ASCII): Also includes vertical tab (ASCII 11). Java seems to follow POSIX standard here.

  • JavaScript (Edition 5.1): According to the specs (word by word), apart from the 5 common ones, it includes:

    • Unicode category Zs (Separator/Space), \u2028 (Line Separator), \u2029 (Paragraph Separator). It basically includes all characters under category Z (Separator).

      Actually \u2028 is the sole member of category Zl (Separator/Line), and \u2029 is the sole member of category Zp (Separator/Paragraph). By the wording, it might be possible that the current version of the specs exclude any further extension to those 2 category.

    • Vertical tab \v
    • Byte-Order Mark a.k.a. ZERO WIDTH NO-BREAK SPACE \ufeff
  • Perl, PCRE (ASCII mode): Vertical tab \v added from Perl 5.18 as experiment. Before 5.18, it only matches the 5 common ones.

  • Perl (Unicode mode): Apart from the 5 common ones

    • Unicode category Z (Separator)
    • Vertical tab \v added from Perl 5.18 as experiment.
    • NEXT LINE (NEL) \u0085
    • MONGOLIAN VOWEL SEPARATOR \u180e
  • .NET (default): Apart from 5 common ones

    • Unicode category Z (Separator)
    • Vertical tab \v
    • NEXT LINE (NEL) \u0085
  • Java (Unicode): From Java 7, Pattern class includes a new flag UNICODE_CHARACTER_CLASS which makes Predefined character classes and POSIX character classes conform to Unicode Technical Standard #18: Unicode Regular Expression. When the flag is active, Predefined character class and the corresponding POSIX character class will become equivalent (match the same thing).

    The list of characters is the same as .NET's.

That is enough to drive one crazy!


\p{Space} is the more "stable" option since it follows the POSIX standard in default mode, and Unicode Technical Standard #18: Unicode Regular Expression in UNICODE_CHARACTER_CLASS.

If you use POSIX character class, POSIX-compliant implementation will have the same behavior in ASCII mode, and Unicode regex engines which follow the recommendation will have the (almost) the same behavior in Unicode mode.

\s and \p{Space} are equivalent in Java, regardless of the flag. If you use \s in Java, you can be sure you are following some standard/recommendation. Just that it does not announce to most programmers about this fact.


\p{isJavaWhitespace} to match whitespace according to Java's definition. The name of the function is extremely misleading.

like image 78
nhahtdh Avatar answered Sep 21 '22 22:09

nhahtdh