Regular expression metacharacters \w and \pL in PHP

Question

In version 5.3.4 - 5.5.0beta1, are \w and \pL equivalent?

 <?php
preg_match_all('#\w#u','سیب',$f);
var_dump($f);

preg_match_all('#\pL#u','سیب',$f);
var_dump($f);

array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(2) "س"
    [1]=>
    string(2) "ی"
    [2]=>
    string(2) "ب"
  }
}
array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(2) "س"
    [1]=>
    string(2) "ی"
    [2]=>
    string(2) "ب"
  }
}

Try the above snippet in the Online PHP shell

nickb · Accepted Answer

It looks like when you use the u modifier in PCRE regular expressions, PHP is also setting PCRE_UCP flag in addition to PCRE_UTF8 flag, causing Unicode properties to be introduced into \w and the other POSIX character classes, instead of just the default ASCII characters. From the man page on PCRE:

PCRE_UCP

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters.

This is then confirmed in the PHP source code (lines 366-372), where we see this:

        case 'u':   coptions |= PCRE_UTF8;
/* In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
   characters, even in UTF-8 mode. However, this can be changed by setting
   the PCRE_UCP option. */
#ifdef PCRE_UCP
                    coptions |= PCRE_UCP;
#endif

So, from the same man page that I linked above, you'll see that when PCRE_UCP is set, the character classes become:

\d any character that \p{Nd} matches (decimal digit)

\s any character that \p{Z} matches, plus HT, LF, FF, CR

\w any character that \p{L} or \p{N} matches, plus underscore

Regular expression metacharacters \w and \pL in PHP

Tags:

regex

php

Handsome Nerd

1 Answers

nickb

Recent Activity

Donate For Us

Regular expression metacharacters \w and \pL in PHP

Tags:

regex

php

Handsome Nerd

1 Answers

nickb

Related questions

Recent Activity

Donate For Us