Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does emoji have two different utf-8 codes? How to convert emoji from utf-8 , use NSString in ios?

We have found an issue, that some emoji have two utf-8 codes, such as:

emoji   unicode    utf-8                another utf-8
😁      U+1F601    \xf0\x9f\x98\x81     \xed\xa0\xbd\xed\xb8\x81

But ios language can't decode the other type of utf-8, so resulting an error when i decode string from utf-8.

ios code


In all documents i found, i can just find one type of utf-8 code for a emoji, no where to find the other.

Documents i referenced includes:

emoji code link

whole utf-8 code link

But in a web tool bianma, all the two types of utf-8 code can be converted into emoji correctly.

input code

ouput


So, my question is :

  1. Why does there have two types of utf-8 codes for one emoji ?

  2. Where has a document which includes the two types of utf-8 codes?

  3. How to correctly convert string from utf-8, using NSString in ios language?

like image 676
pinchwang Avatar asked Dec 22 '15 05:12

pinchwang


People also ask

Can Emojis be encoded in UTF-8?

Emojis look like images, or icons, but they are not. They are letters (characters) from the UTF-8 (Unicode) character set. UTF-8 covers almost all of the characters and symbols in the world.

What is the difference between utf8 and utf16?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.


1 Answers

0xF0, 0x9F, 0x98, 0x81

Is the correct UTF-8 encoding for U+1F601 😁.

0xED, 0xA0, 0xBD, 0xED, 0xB8, 0x81

Is not a valid UTF-8 sequence(*). It should really be rejected; iOS is correct to do so.

This is a bug in the bianma tool: the convertUtf8BytesToUnicodeCodePoints function is more lenient about what input it accepts than the specified algorithm in eg RFC 3629.

This happens to return a working string only because the tool is written in JavaScript. Having decoded the above byte sequence to the bogus surrogate code point sequence U+D83D,U+DE01 it then converts that into a JavaScript string using a direct code-point-to-code-unit mapping giving \uD83D\xDE01. As this is the correct way to encode 😁 in a UTF-16 string it appears to have worked.

(*: It is a valid CESU-8 sequence, but that encoding is just “bogus broken encoding for compatibility with badly-written historical tools” and should generally be avoided.)

You should not usually encounter a sequence like this; it is typically not worth catering for unless you have a specific source of this kind of malformed data which you don't have the power to get fixed.

like image 140
bobince Avatar answered Sep 29 '22 13:09

bobince