Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting emojis to Unicode and vice versa in python 3

I am trying to convert an emoji into its Unicode in python 3. For example I would have the emoji 😀 and from this would like to get the corresponding unicode 'U+1F600'. Similarly I would like to convert the 'U+1F600' back to 😀. Now I have read the documentation and tried several options but pythons behaviour confuses me here.

>>> x = '😀'
>>> y = x.encode('utf-8')
>>> y
b'\xf0\x9f\x98\x80'

The emoji is converted to a byte object.

>>> z = y.decode('utf-8')
>>> z
'😀'

Converted the byte object back to the emoji, so far so good.

Now, taking the unicode for the emoji:

>>> c = '\U0001F600'
>>> d = c.encode('utf-8')
>>> d
>>> b'\xf0\x9f\x98\x80'

This prints out the byte encoding again.

>>> d.decode('utf-8')
>>> '😀'

This prints the emoji out again. I really can't figure out how to convert solely between the Unicode and the emoji.

like image 428
imc Avatar asked Dec 08 '17 14:12

imc


People also ask

How do you print Unicode emojis in Python?

Emojis also have a CLDR short name, which can also be used. From the list of unicodes, replace “+” with “000”. For example – “U+1F600” will become “U0001F600” and prefix the unicode with “\” and print it.

Can emojis be represented by Unicode?

No. Because emoji characters are treated as pictographs, they are encoded in Unicode based primarily on their general appearance, not on an intended semantic.


2 Answers

'😀' is already a Unicode object. UTF-8 is not Unicode, it's a byte encoding for Unicode. To get the codepoint number of a Unicode character, you can use the ord function. And to print it in the form you want you can format it as hex. Like this:

s = '😀'
print('U+{:X}'.format(ord(s)))

output

U+1F600

If you have Python 3.6+, you can make it even shorter (and more efficient) by using an f-string:

s = '😀'
print(f'U+{ord(s):X}')

BTW, if you want to create a Unicode escape sequence like '\U0001F600' there's the 'unicode-escape' codec. However, it returns a bytes string, and you may wish to convert that back to text. You could use the 'UTF-8' codec for that, but you might as well just use the 'ASCII' codec, since it's guaranteed to only contain valid ASCII.

s = '😀'
print(s.encode('unicode-escape'))
print(s.encode('unicode-escape').decode('ASCII'))

output

b'\\U0001f600'
\U0001f600

I suggest you take a look at this short article by Stack Overflow co-founder Joel Spolsky The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

like image 180
PM 2Ring Avatar answered Oct 09 '22 18:10

PM 2Ring


sentence = "Head-Up Displays (HUD)💻 for #automotive🚗 sector\n \nThe #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… "
print("normal sentence - ", sentence)

uc_sentence = sentence.encode('unicode-escape')
print("\n\nunicode represented sentence - ", uc_sentence)

decoded_sentence = uc_sentence.decode('unicode-escape')
print("\n\ndecoded sentence - ", decoded_sentence)

output

normal sentence -  Head-Up Displays (HUD)💻 for #automotive🚗 sector
 
The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… 


unicode represented sentence -  b'Head-Up Displays (HUD)\\U0001f4bb for #automotive\\U0001f697 sector\\n \\nThe #UK-based #startup\\U0001f680 Envisics got \\u20ac42 million #funding\\U0001f4b0 from l\\u2026 '


decoded sentence -  Head-Up Displays (HUD)💻 for #automotive🚗 sector
 
The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… 
like image 3
Pratyush Behera Avatar answered Oct 09 '22 17:10

Pratyush Behera