Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I restore proper encoding of 4 byte emoji characters that have been stored in plain utf8 - like this: 😊?

Is it possible to re-encode emoji 3 or 4 byte strings into emoji again?

I inherited a MySQL Innodb table with utf8_unicode_ci encoding. These emoji 4 byte strings are everywhere. Is it possible to translate them back into emoji?

First step was to modify the character set to utf8mb4. This changed all strings like � to strings like this: 😊.

But what I really want is to translate 😊 into something like smiley emoji. (I have no idea if 😊 is really a smiley)

like image 755
Ryan Avatar asked Nov 20 '13 22:11

Ryan


People also ask

What encoding is used for Emojis?

The Unicode Standard has assigned numbers to represent emojis. Here's how it works. In the Unicode Standard, each emoji is represented as a "code point" (a hexadecimal number) that looks like U+1F063, for example.

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

How do you determine character encoding?

One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).


1 Answers

Inspired by Ignacio Vazquez-Abrams' comment. Next python code snippet shows origin procedure Emoji to Mojibake and vice versa (repair):

print ( "\nEmoji to mojibake (origin):")
for emojiChar in ['😊','😣','👽','😎']:
    print ( emojiChar, emojiChar.encode('utf8').decode('cp1252'))

print ( "\nmojibake to Emoji (repair):")
for mojibakeString in ['😊','😣','👽','😎','🙇']:
    print ( mojibakeString, mojibakeString.encode('cp1252').decode('utf8'))

I know that the question is tagged php rather than python; let me hope that analogous php solution could be very close…

Output:

==> chcp 65001
Active code page: 65001

==> D:\test\Python\20108312.py

Emoji to mojibake (origin):
😊 😊
😣 😣
👽 👽
😎 😎

mojibake to Emoji (repair):
😊 😊
😣 😣
👽 👽
😎 😎
🙇 🙇

==>

Python version:

Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32
like image 107
JosefZ Avatar answered Oct 16 '22 03:10

JosefZ