Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

preg_split vs mb_split

According to the PHP manual, the u modifier of PCRE regular expressions enables UTF-8 support for both the pattern and the subject string.

Considering this, is there any difference between using PCRE expressions with the u modifier and the corresponding mb_* multibyte string functions? (Assuming that all strings are UTF-8 encoded.)


As an example, consider preg_split vs mb_split: Both

preg_split('/' . $pattern . '/u', $string);

and

mb_split($pattern, $string);

seem to return identical results. So, which one should be preferred? Does it even matter?

like image 241
emkey08 Avatar asked Mar 20 '16 15:03

emkey08


2 Answers

The main difference is that preg_ functions use the pcre library, when the mb_ereg_ functions (including mb_split) use the oniguruma library (used in ruby before the version 2.0).

The main reason is that oniguruma can deal with multiple encodings (ASCII, UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, EUC-JP, EUC-TW, EUC-KR, EUC-CN, Shift_JIS, Big5, GB18030, KOI8-R, CP1251, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16) when pcre can't.

Note that a lot of encodings available for mb_ functions like mb_detect_encoding are not in this list (UTF-7, ArmSCII-8, CP866 for example) limiting the relevance of mb_ereg_ functions. (Since you need to convert the string to a supported encoding before working on it, and to convert it back after.)

The two regex engines share more or less the same features, nevertheless you can find some differences (not exhaustive, as it comes):

Oniguruma doesn't support:

  • one letter unicode shorthand character classes to be written without curly brackets.
    Example: \pN is seen as pN, you need to write:\p{N}
  • the unicode character classes: Xan, Xps, Xsp, Xwd
  • non-escaped square brackets in a character class: Oniguruma see [][] as two empty character classes, when pcre see a character class that contains ] and [
  • the \K feature
  • the \R alias for newline sequences
  • named groups that use the Python syntax (?P<name>...). Only (?<name>...) or (?'name'...) are allowed.
  • group references with something else than the Oniguruma syntax: \g<name> (Perl syntax (?&name) and (?1) or (?R) are not allowed).
  • backtracking control verbs

PCRE doesn't support:

  • duplicated named groups (by default). You need to use the (?J) modifier to switch on this feature.
  • numbered back-references with \k<...> syntax. You can write \k<name> but not \k<1> or \k<-1>.
  • back-references to a specific nest level. Oniguruma is able to do that using \k<name+n> where n is the nest level.


To match newlines with the dot, Oniguruma uses the m modifier, when PCRE uses the s modifier. In mb_ereg_ functions, the dot matches newlines by default. (So the m modifier is on by default).

PCRE uses the s modifier to match newline with the dot. The m modifier behaves differently in PCRE, it changes the meaning of ^ and $ anchors from "start" and "end" of the string to "start" and "end" of the line.

With Oniguruma, the meaning of these anchors doesn't change, they match always the start and end of the line. To match the limit of the string, it uses \A and \z also available with PCRE.

Note that Oniguruma has been forked to give Onigmo (used in current Ruby versions) that implements more Perl features and syntactic elements, and that is more similar to PCRE.

like image 186
Casimir et Hippolyte Avatar answered Oct 01 '22 21:10

Casimir et Hippolyte


As long as you're working strictly with UTF-8 you will be fine with either. If you were using another charset then it would be recommended to use mb_split() since the u modifier with PCRE does not allow you to specify the charset, instead treating the strings as UTF-8.

In regards to scaling and long-term viability I would recommend using mb_split() from the start so that you are covered in case something other than UTF-8 is used or needed down the road.

like image 27
Cale W. Vernon Avatar answered Oct 01 '22 20:10

Cale W. Vernon