We have found an issue, that some emoji have two utf-8 codes, such as:
emoji unicode utf-8 another utf-8
😁 U+1F601 \xf0\x9f\x98\x81 \xed\xa0\xbd\xed\xb8\x81
But ios language can't decode the other type of utf-8, so resulting an error when i decode string from utf-8.
In all documents i found, i can just find one type of utf-8 code for a emoji, no where to find the other.
Documents i referenced includes:
emoji code link
whole utf-8 code link
But in a web tool bianma, all the two types of utf-8 code can be converted into emoji correctly.
So, my question is :
Why does there have two types of utf-8 codes for one emoji ?
Where has a document which includes the two types of utf-8 codes?
How to correctly convert string from utf-8, using NSString in ios language?
Emojis look like images, or icons, but they are not. They are letters (characters) from the UTF-8 (Unicode) character set. UTF-8 covers almost all of the characters and symbols in the world.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
0xF0, 0x9F, 0x98, 0x81
Is the correct UTF-8 encoding for U+1F601 😁.
0xED, 0xA0, 0xBD, 0xED, 0xB8, 0x81
Is not a valid UTF-8 sequence(*). It should really be rejected; iOS is correct to do so.
This is a bug in the bianma tool: the convertUtf8BytesToUnicodeCodePoints
function is more lenient about what input it accepts than the specified algorithm in eg RFC 3629.
This happens to return a working string only because the tool is written in JavaScript. Having decoded the above byte sequence to the bogus surrogate code point sequence U+D83D,U+DE01 it then converts that into a JavaScript string using a direct code-point-to-code-unit mapping giving \uD83D\xDE01
. As this is the correct way to encode 😁 in a UTF-16 string it appears to have worked.
(*: It is a valid CESU-8 sequence, but that encoding is just “bogus broken encoding for compatibility with badly-written historical tools” and should generally be avoided.)
You should not usually encounter a sequence like this; it is typically not worth catering for unless you have a specific source of this kind of malformed data which you don't have the power to get fixed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With