In version 5.3.4 - 5.5.0beta1, are \w
and \pL
equivalent?
<?php
preg_match_all('#\w#u','سیب',$f);
var_dump($f);
preg_match_all('#\pL#u','سیب',$f);
var_dump($f);
array(1) {
[0]=>
array(3) {
[0]=>
string(2) "س"
[1]=>
string(2) "ی"
[2]=>
string(2) "ب"
}
}
array(1) {
[0]=>
array(3) {
[0]=>
string(2) "س"
[1]=>
string(2) "ی"
[2]=>
string(2) "ب"
}
}
Try the above snippet in the Online PHP shell
It looks like when you use the u
modifier in PCRE regular expressions, PHP is also setting PCRE_UCP
flag in addition to PCRE_UTF8
flag, causing Unicode properties to be introduced into \w
and the other POSIX character classes, instead of just the default ASCII characters. From the man page on PCRE:
PCRE_UCP
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters.
This is then confirmed in the PHP source code (lines 366-372), where we see this:
case 'u': coptions |= PCRE_UTF8;
/* In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
characters, even in UTF-8 mode. However, this can be changed by setting
the PCRE_UCP option. */
#ifdef PCRE_UCP
coptions |= PCRE_UCP;
#endif
So, from the same man page that I linked above, you'll see that when PCRE_UCP
is set, the character classes become:
\d any character that \p{Nd} matches (decimal digit)
\s any character that \p{Z} matches, plus HT, LF, FF, CR
\w any character that \p{L} or \p{N} matches, plus underscore
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With