Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wrong output when using array indexing on UTF-8 string

I have encountered a problem when using a UTF-8 string. I want to read a single character from the string, for example:

$string = "üÜöÖäÄ";
echo $string[0];

I am expecting to see ü, but I get � -- why?

like image 356
bozd Avatar asked Jun 11 '11 11:06

bozd


People also ask

Does STD string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

What is UTF-8 C++?

UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text.

What is System text encoding UTF-8?

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8. Standard. Unicode Standard.


1 Answers

Use mb_substr($string, 0, 1, 'utf-8') to get the character instead.

What happens in your code is that the expression $string[0] gets the first byte of the UTF-8 encoded representation of your string because PHP strings are effectively arrays of bytes (PHP does not internally recognize encodings).

Since the first character in your string is composed in more than one byte (UTF-8 encoding rules), you are effectively only getting part of the character. Furthermore, these rules make the byte you are retrieving invalid to stand as a character on its own, which is why you see the question mark.

mb_substr knows the encoding rules, so it will not naively give you back just one byte; it will get as many as needed to encode the first character.

You can see that $string[0] gives you back just one byte with:

$string = "üÜöÖäÄ";
echo strlen($string[0]);

While mb_substr gives you back two bytes:

$string = "üÜöÖäÄ";
echo strlen(mb_substr($string, 0, 1, 'utf-8'));

And these two bytes are in fact just one character (you need to use mb_strlen for this):

$string = "üÜöÖäÄ";
echo mb_strlen(mb_substr($string, 0, 1, 'utf-8'), 'utf-8');

Finally, as Marwelln points out below, the situation becomes more tolerable if you use mb_internal_encoding to get rid of the 'utf-8' redundancy:

$string = "üÜöÖäÄ";
mb_internal_encoding('utf-8');
echo mb_strlen(mb_substr($string, 0, 1));

You can see most of the above in action.

like image 116
Jon Avatar answered Sep 18 '22 17:09

Jon