Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression metacharacters \w and \pL in PHP

Tags:

regex

php

In version 5.3.4 - 5.5.0beta1, are \w and \pL equivalent?

 <?php
preg_match_all('#\w#u','سیب',$f);
var_dump($f);

preg_match_all('#\pL#u','سیب',$f);
var_dump($f);

array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(2) "س"
    [1]=>
    string(2) "ی"
    [2]=>
    string(2) "ب"
  }
}
array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(2) "س"
    [1]=>
    string(2) "ی"
    [2]=>
    string(2) "ب"
  }
}

Try the above snippet in the Online PHP shell

like image 622
Handsome Nerd Avatar asked Feb 17 '23 00:02

Handsome Nerd


1 Answers

It looks like when you use the u modifier in PCRE regular expressions, PHP is also setting PCRE_UCP flag in addition to PCRE_UTF8 flag, causing Unicode properties to be introduced into \w and the other POSIX character classes, instead of just the default ASCII characters. From the man page on PCRE:

PCRE_UCP

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters.

This is then confirmed in the PHP source code (lines 366-372), where we see this:

        case 'u':   coptions |= PCRE_UTF8;
/* In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
   characters, even in UTF-8 mode. However, this can be changed by setting
   the PCRE_UCP option. */
#ifdef PCRE_UCP
                    coptions |= PCRE_UCP;
#endif

So, from the same man page that I linked above, you'll see that when PCRE_UCP is set, the character classes become:

\d any character that \p{Nd} matches (decimal digit)

\s any character that \p{Z} matches, plus HT, LF, FF, CR

\w any character that \p{L} or \p{N} matches, plus underscore

like image 133
nickb Avatar answered Feb 22 '23 22:02

nickb