Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert a String into an Array of Characters - multi-byte

Assuming that in 2019 every solution which is not UNICODE-safe is wrong. What is the best way to convert a string to array of UNICODE characters in PHP?

Obviously this means that accessing the bytes with the brace syntax is wrong, as well as using str_split:

$arr = str_split($text);

From sample input like:

$string = '先Êeˁâ‚Ŧ𐍈💩👩‍ đŸ‘Šâ€â¤ī¸â€đŸ‘Š';

I expect:

array(16) {


[0]=>
  string(3) "先"
  [1]=>
  string(2) "Ê"
  [2]=>
  string(1) "e"
  [3]=>
  string(2) "ˁ"
  [4]=>
  string(3) "â‚Ŧ"
  [5]=>
  string(4) "𐍈"
  [6]=>
  string(4) "💩"
  [7]=>
  string(4) "👩"
  [8]=>
  string(3) "‍"
  [9]=>
  string(1) " "
  [10]=>
  string(4) "👩"
  [11]=>
  string(3) "‍"
  [12]=>
  string(3) "❤"
  [13]=>
  string(3) "ī¸"
  [14]=>
  string(3) "‍"
  [15]=>
  string(4) "👩"
}
like image 297
Dharman Avatar asked Oct 27 '25 15:10

Dharman


1 Answers

Just pass an empty pattern with the PREG_SPLIT_NO_EMPTY flag. Otherwise, you can write a pattern with \X (unicode dot) and \K (restart fullstring match). I'll include a mb_split() call and a preg_match_all() call for completeness.

Code: (Demo)

$string='先į§Ļå…Šæŧĸ';
var_export(preg_split('~~u', $string, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_split('~\X\K~u', $string, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_split('~\X\K(?!$)~u', $string));
echo "\n---\n";
var_export(mb_split('\X\K(?!$)', $string));
echo "\n---\n";
var_export(preg_match_all('~\X~u', $string, $out) ? $out[0] : []);

All produce::

array (
  0 => '先',
  1 => 'į§Ļ',
  2 => 'å…Š',
  3 => 'æŧĸ',
)

From https://www.regular-expressions.info/unicode.html:

How to Match a Single Unicode Grapheme

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use \X.

You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.


UPDATE, DHarman has brought to my attention that mb_str_split() is now available from PHP7.4.

The default length parameter of the new function is 1, so the length parameter can be omitted for this case.

https://wiki.php.net/rfc/mb_str_split

Dharman's demo: https://3v4l.org/M85Fi/rfc#output


UPDATE (2024-04-10):

The RFC has unanimously passed for grapheme_str_split() and is proposed for inclusion into PHP8.4. This provides a clean, native solution which will preserve bound multi-byte "clusters" (such emojis and variation selectors).

$string = 'đŸ™‡â€â™‚ī¸'
var_export(grapheme_str_split($string)); // ['đŸ™‡â€â™‚ī¸']

Here is what the result would be if the cluster was not held together: (split on individual multibyte characters)

[
    '🙇'
    '',   // U+200D Zero Width Joiner
    '♂',
    '',   // U+FE0F Variation Selector
]

I'll add a 3v4l.org demo when possible.

like image 64
mickmackusa Avatar answered Oct 30 '25 06:10

mickmackusa



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!