I have encountered a problem when using a UTF-8 string. I want to read a single character from the string, for example:
$string = "üÜöÖäÄ";
echo $string[0];
I am expecting to see ü
, but I get � -- why?
UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text.
UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8. Standard. Unicode Standard.
Use mb_substr($string, 0, 1, 'utf-8')
to get the character instead.
What happens in your code is that the expression $string[0]
gets the first byte of the UTF-8 encoded representation of your string because PHP strings are effectively arrays of bytes (PHP does not internally recognize encodings).
Since the first character in your string is composed in more than one byte (UTF-8 encoding rules), you are effectively only getting part of the character. Furthermore, these rules make the byte you are retrieving invalid to stand as a character on its own, which is why you see the question mark.
mb_substr
knows the encoding rules, so it will not naively give you back just one byte; it will get as many as needed to encode the first character.
You can see that $string[0]
gives you back just one byte with:
$string = "üÜöÖäÄ";
echo strlen($string[0]);
While mb_substr
gives you back two bytes:
$string = "üÜöÖäÄ";
echo strlen(mb_substr($string, 0, 1, 'utf-8'));
And these two bytes are in fact just one character (you need to use mb_strlen
for this):
$string = "üÜöÖäÄ";
echo mb_strlen(mb_substr($string, 0, 1, 'utf-8'), 'utf-8');
Finally, as Marwelln points out below, the situation becomes more tolerable if you use mb_internal_encoding
to get rid of the 'utf-8'
redundancy:
$string = "üÜöÖäÄ";
mb_internal_encoding('utf-8');
echo mb_strlen(mb_substr($string, 0, 1));
You can see most of the above in action.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With