I am trying to convert an emoji into its Unicode in python 3. For example I would have the emoji 😀 and from this would like to get the corresponding unicode 'U+1F600'. Similarly I would like to convert the 'U+1F600' back to 😀. Now I have read the documentation and tried several options but pythons behaviour confuses me here.
>>> x = '😀'
>>> y = x.encode('utf-8')
>>> y
b'\xf0\x9f\x98\x80'
The emoji is converted to a byte object.
>>> z = y.decode('utf-8')
>>> z
'😀'
Converted the byte object back to the emoji, so far so good.
Now, taking the unicode for the emoji:
>>> c = '\U0001F600'
>>> d = c.encode('utf-8')
>>> d
>>> b'\xf0\x9f\x98\x80'
This prints out the byte encoding again.
>>> d.decode('utf-8')
>>> '😀'
This prints the emoji out again. I really can't figure out how to convert solely between the Unicode and the emoji.
Emojis also have a CLDR short name, which can also be used. From the list of unicodes, replace “+” with “000”. For example – “U+1F600” will become “U0001F600” and prefix the unicode with “\” and print it.
No. Because emoji characters are treated as pictographs, they are encoded in Unicode based primarily on their general appearance, not on an intended semantic.
'😀' is already a Unicode object. UTF-8 is not Unicode, it's a byte encoding for Unicode. To get the codepoint number of a Unicode character, you can use the ord
function. And to print it in the form you want you can format it as hex. Like this:
s = '😀'
print('U+{:X}'.format(ord(s)))
output
U+1F600
If you have Python 3.6+, you can make it even shorter (and more efficient) by using an f-string:
s = '😀'
print(f'U+{ord(s):X}')
BTW, if you want to create a Unicode escape sequence like '\U0001F600'
there's the 'unicode-escape'
codec. However, it returns a bytes
string, and you may wish to convert that back to text. You could use the 'UTF-8' codec for that, but you might as well just use the 'ASCII' codec, since it's guaranteed to only contain valid ASCII.
s = '😀'
print(s.encode('unicode-escape'))
print(s.encode('unicode-escape').decode('ASCII'))
output
b'\\U0001f600'
\U0001f600
I suggest you take a look at this short article by Stack Overflow co-founder Joel Spolsky The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
sentence = "Head-Up Displays (HUD)💻 for #automotive🚗 sector\n \nThe #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… "
print("normal sentence - ", sentence)
uc_sentence = sentence.encode('unicode-escape')
print("\n\nunicode represented sentence - ", uc_sentence)
decoded_sentence = uc_sentence.decode('unicode-escape')
print("\n\ndecoded sentence - ", decoded_sentence)
output
normal sentence - Head-Up Displays (HUD)💻 for #automotive🚗 sector
The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l…
unicode represented sentence - b'Head-Up Displays (HUD)\\U0001f4bb for #automotive\\U0001f697 sector\\n \\nThe #UK-based #startup\\U0001f680 Envisics got \\u20ac42 million #funding\\U0001f4b0 from l\\u2026 '
decoded sentence - Head-Up Displays (HUD)💻 for #automotive🚗 sector
The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l…
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With