Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How get each character from a word with special encoding

I need to get an array with all the characters from a word, but the word has letters with special encoding like á, when I execute the follow code:

$word = 'withá';

$word_arr = array();
for ($i=0;$i<strlen($word);$i++) {
    $word_arr[] = $word[$i];
}

or

$word_arr = str_split($word);

I get:

array(6) { [0]=> string(1) "w" [1]=> string(1) "i" [2]=> string(1) "t" [3]=> string(1) "h" [4]=> string(1) "Ã" [5]=> string(1) "¡" }

How can I do to obtain each character as follow?

array(5) { [0]=> string(1) "w" [1]=> string(1) "i" [2]=> string(1) "t" [3]=> string(1) "h" [4]=> string(1) "á" }

like image 208
leticia Avatar asked Nov 21 '12 20:11

leticia


2 Answers

Because it is a UTF-8 string, just do

$word = 'withá';
$word = utf8_decode($word);
$word_arr = array();
for ($i=0;$i<strlen($word);$i++) {
    $word_arr[] = $word[$i];
}

The reason for this is that, even though it looks right in your script, the interpreter converts it into a multibyte character (why mb_split() works as well). To convert it to proper UTF-8 format, you can use the mb functions or just specify utf8_decode().

like image 109
Tim Withers Avatar answered Nov 14 '22 22:11

Tim Withers


I think mb_split will do it for you: http://www.php.net/manual/en/function.mb-split.php

If you're using special encodings, you probably want to read up on how PHP handles multibyte encoding in general...

EDIT: Nope, can't figure out how to make mb_split do it myself, but looking around SO got some other questions that were answered with preg_split. I tested this and it seems to do exactly what you want:

preg_split('//',$word,-1,PREG_SPLIT_NO_EMPTY);

I'd still strongly suggest you read up on multibyte characters in PHP though. It's kind of a mess, IMHO.

Here's some good links: http://www.joelonsoftware.com/articles/Unicode.html and http://akrabat.com/php/utf8-php-and-mysql/ and plenty more can be found...

like image 30
Aerik Avatar answered Nov 14 '22 22:11

Aerik