In PHP, what is the best way to split a string into an array of Unicode characters? If the input is not necessarily UTF-8? I want to know whether the set of Unicode characters in an input string is a subset of another set of Unicode characters. Why not run straight for the <code>mb_</code> family of functions, as the first couple of answers didn't?

You could use the 'u' modifier with PCRE regex ; see Pattern Modifiers (quoting) : <blockquote> u (PCRE8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5. </blockquote> For instance, considering this code : <pre class="prettyprint"><code>header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder $str = "abc 文字化け, efg"; $results = array(); preg_match_all('/./', $str, $results); var_dump($results[0]); </code></pre> You'll get an unusable result: <pre class="prettyprint"><code>array 0 => string 'a' (length=1) 1 => string 'b' (length=1) 2 => string 'c' (length=1) 3 => string ' ' (length=1) 4 => string '�' (length=1) 5 => string '�' (length=1) 6 => string '�' (length=1) 7 => string '�' (length=1) 8 => string '�' (length=1) 9 => string '�' (length=1) 10 => string '�' (length=1) 11 => string '�' (length=1) 12 => string '�' (length=1) 13 => string '�' (length=1) 14 => string '�' (length=1) 15 => string '�' (length=1) 16 => string ',' (length=1) 17 => string ' ' (length=1) 18 => string 'e' (length=1) 19 => string 'f' (length=1) 20 => string 'g' (length=1) </code></pre> But, with this code : <pre class="prettyprint"><code>header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder $str = "abc 文字化け, efg"; $results = array(); preg_match_all('/./u', $str, $results); var_dump($results[0]); </code></pre> (Notice the 'u' at the end of the regex) You get what you want : <pre class="prettyprint"><code>array 0 => string 'a' (length=1) 1 => string 'b' (length=1) 2 => string 'c' (length=1) 3 => string ' ' (length=1) 4 => string '文' (length=3) 5 => string '字' (length=3) 6 => string '化' (length=3) 7 => string 'け' (length=3) 8 => string ',' (length=1) 9 => string ' ' (length=1) 10 => string 'e' (length=1) 11 => string 'f' (length=1) 12 => string 'g' (length=1) </code></pre> Hope this helps :-)

What is the best way to split a string into an array of Unicode characters in PHP?

1 Answers

You could use the 'u' modifier with PCRE regex ; see Pattern Modifiers (quoting) :

u (PCRE8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

For instance, considering this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./', $str, $results);
var_dump($results[0]);

You'll get an unusable result:

array
  0 => string 'a' (length=1)
  1 => string 'b' (length=1)
  2 => string 'c' (length=1)
  3 => string ' ' (length=1)
  4 => string '�' (length=1)
  5 => string '�' (length=1)
  6 => string '�' (length=1)
  7 => string '�' (length=1)
  8 => string '�' (length=1)
  9 => string '�' (length=1)
  10 => string '�' (length=1)
  11 => string '�' (length=1)
  12 => string '�' (length=1)
  13 => string '�' (length=1)
  14 => string '�' (length=1)
  15 => string '�' (length=1)
  16 => string ',' (length=1)
  17 => string ' ' (length=1)
  18 => string 'e' (length=1)
  19 => string 'f' (length=1)
  20 => string 'g' (length=1)

But, with this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./u', $str, $results);
var_dump($results[0]);

(Notice the 'u' at the end of the regex)

You get what you want :

array
  0 => string 'a' (length=1)
  1 => string 'b' (length=1)
  2 => string 'c' (length=1)
  3 => string ' ' (length=1)
  4 => string '文' (length=3)
  5 => string '字' (length=3)
  6 => string '化' (length=3)
  7 => string 'け' (length=3)
  8 => string ',' (length=1)
  9 => string ' ' (length=1)
  10 => string 'e' (length=1)
  11 => string 'f' (length=1)
  12 => string 'g' (length=1)

Hope this helps :-)

133

answered Sep 28 '22 09:09

Pascal MARTIN

Related questions
                            
                                Doctrine2 association mapping with conditions
                            
                                How to enable https (localhost) url in WAMP server (v2.5)? [duplicate]
                            
                                Pass a custom message (or any other data) to Laravel 404.blade.php
                            
                                Telegram Bot custom keyboard in PHP
                            
                                jQuery-like selectors for PHP DOMDocument
                            
                                PHP equivalent to Python's yield operator
                            
                                PHP: Merge 2 Multidimensional Arrays
                            
                                ftp_nlist command not working
                            
                                mysql PDO how to bind LIKE
                            
                                In PHP when submitting strings to the database should I take care of illegal characters using htmlspecialchars() or use a regular expression?
                            
                                PHP Warning: PHP Startup: ????????: Unable to initialize module
                            
                                Curl request is failing on the SSL?
                            
                                Passing .PEM and .KEY as string in Curl using PHP
                            
                                Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags
                            
                                CodeIgniter SMTP email message - characters replaced with equal signs
                            
                                How to send out HTML email with mailgun?
                            
                                How do I perform a Mass delete using Laravel 4.1, based on array of ids or objects?
                            
                                How to get last inserted id in yii2 using createCommand? [duplicate]
                            
                                Laravel 5: when persist form data, _token causes mass assignment exception
                            
                                How to replace the entire html webpage with ajax response?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the best way to split a string into an array of Unicode characters in PHP?

Tags:

arrays

php

split

unicode

joeforker

People also ask

1 Answers

Pascal MARTIN

Recent Activity

Donate For Us