Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the whitespaces matched by \s in PHP?

Tags:

regex

php

What is the complete list of characters matched by the escape sequence \s in PHP ? Some regex flavors include vertical space and other characters in this escape sequence.

like image 580
Stephan Avatar asked Mar 29 '11 11:03

Stephan


3 Answers

From pcrepattern specifications page:

Generic character types

\s     any white space character

For compatibility with Perl, \s did not used to match the VT character (code 11), which made it different from the the POSIX "space" class. However, Perl added VT at release 5.18, and PCRE followed suit at release 8.34. The default \s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32), which are defined as white space in the "C" locale. This list may vary if locale-specific matching is taking place. For example, in some locales the "non-breaking space" character (\xA0) is recognized as white space, and in others the VT character is not.

So \s will match 5 characters plus more depending on:

  1. PCRE library version
  2. Locale setting

This test compares the result of preg_match across various versions of PHP.

like image 81
Salman A Avatar answered Nov 16 '22 09:11

Salman A


PHP has \h for horizontal whitespace characters only: http://www.php.net/manual/en/regexp.reference.escape.php

According to http://www.pcre.org/pcre.txt :

For compatibility with Perl, \s does not match the VT character (code 11). This makes it different from the the POSIX "space" class. The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is included in a Perl script, \s may match the VT charac- ter. In PCRE, it never does.

So if "Vertical space" refers to vertical tab, the answer is no.

The  sequences  \h, \H, \v, and \V are features that were added to Perl
at release 5.10. In contrast to the other sequences, which  match  only
ASCII  characters  by  default,  these always match certain high-valued
codepoints in UTF-8 mode, whether or not PCRE_UCP is set.

The  horizontal space characters are:

         U+0009     Horizontal tab
         U+0020     Space
         U+00A0     Non-break space
         U+1680     Ogham space mark
         U+180E     Mongolian vowel separator
         U+2000     En quad
         U+2001     Em quad
         U+2002     En space
         U+2003     Em space
         U+2004     Three-per-em space
         U+2005     Four-per-em space
         U+2006     Six-per-em space
         U+2007     Figure space
         U+2008     Punctuation space
         U+2009     Thin space
         U+200A     Hair space
         U+202F     Narrow no-break space
         U+205F     Medium mathematical space
         U+3000     Ideographic space

The vertical space characters are:

         U+000A     Linefeed
         U+000B     Vertical tab
         U+000C     Formfeed
         U+000D     Carriage return
         U+0085     Next line
         U+2028     Line separator
         U+2029     Paragraph separator
like image 39
Kobi Avatar answered Nov 16 '22 08:11

Kobi


From http://www.pcre.org/pcre.txt:

\s any character that \p{Z} matches, plus HT, LF, FF, CR

like image 43
Valentin Jacquemin Avatar answered Nov 16 '22 07:11

Valentin Jacquemin