According to the PHP manual, the u
modifier of PCRE regular expressions enables UTF-8 support for both the pattern and the subject string.
Considering this, is there any difference between using PCRE expressions with the u
modifier and the corresponding mb_*
multibyte string functions? (Assuming that all strings are UTF-8 encoded.)
As an example, consider preg_split
vs mb_split
: Both
preg_split('/' . $pattern . '/u', $string);
and
mb_split($pattern, $string);
seem to return identical results. So, which one should be preferred? Does it even matter?
The main difference is that preg_
functions use the pcre library, when the mb_ereg_
functions (including mb_split
) use the oniguruma library (used in ruby before the version 2.0).
The main reason is that oniguruma can deal with multiple encodings (ASCII, UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, EUC-JP, EUC-TW, EUC-KR, EUC-CN, Shift_JIS, Big5, GB18030, KOI8-R, CP1251, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16) when pcre can't.
Note that a lot of encodings available for mb_
functions like mb_detect_encoding
are not in this list (UTF-7, ArmSCII-8, CP866 for example) limiting the relevance of mb_ereg_
functions. (Since you need to convert the string to a supported encoding before working on it, and to convert it back after.)
The two regex engines share more or less the same features, nevertheless you can find some differences (not exhaustive, as it comes):
Oniguruma doesn't support:
\pN
is seen as pN
, you need to write:\p{N}
[][]
as two empty character classes, when pcre see a character class that contains ]
and [
\K
feature\R
alias for newline sequences(?P<name>...)
. Only (?<name>...)
or (?'name'...)
are allowed.\g<name>
(Perl syntax (?&name)
and (?1)
or (?R)
are not allowed).PCRE doesn't support:
(?J)
modifier to switch on this feature.\k<...>
syntax. You can write \k<name>
but not \k<1>
or \k<-1>
.\k<name+n>
where n
is the nest level.
To match newlines with the dot, Oniguruma uses the m
modifier, when PCRE uses the s
modifier.
In mb_ereg_
functions, the dot matches newlines by default. (So the m
modifier is on by default).
PCRE uses the s
modifier to match newline with the dot. The m
modifier behaves differently in PCRE, it changes the meaning of ^
and $
anchors from "start" and "end" of the string to "start" and "end" of the line.
With Oniguruma, the meaning of these anchors doesn't change, they match always the start and end of the line. To match the limit of the string, it uses \A
and \z
also available with PCRE.
Note that Oniguruma has been forked to give Onigmo (used in current Ruby versions) that implements more Perl features and syntactic elements, and that is more similar to PCRE.
As long as you're working strictly with UTF-8
you will be fine with either. If you were using another charset
then it would be recommended to use mb_split()
since the u
modifier with PCRE does not allow you to specify the charset
, instead treating the strings as UTF-8
.
In regards to scaling and long-term viability I would recommend using mb_split()
from the start so that you are covered in case something other than UTF-8
is used or needed down the road.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With