I am trying to convert an emoji into its Unicode in python 3. For example I would have the emoji 😀 and from this would like to get the corresponding unicode 'U+1F600'. Similarly I would like to convert the 'U+1F600' back to 😀. Now I have read the documentation and tried several options but pythons behaviour confuses me here. <pre class="prettyprint"><code>>>> x = '😀' >>> y = x.encode('utf-8') >>> y b'\xf0\x9f\x98\x80' </code></pre> The emoji is converted to a byte object. <pre class="prettyprint"><code>>>> z = y.decode('utf-8') >>> z '😀' </code></pre> Converted the byte object back to the emoji, so far so good. Now, taking the unicode for the emoji: <pre class="prettyprint"><code>>>> c = '\U0001F600' >>> d = c.encode('utf-8') >>> d >>> b'\xf0\x9f\x98\x80' </code></pre> This prints out the byte encoding again. <pre class="prettyprint"><code>>>> d.decode('utf-8') >>> '😀' </code></pre> This prints the emoji out again. I really can't figure out how to convert solely between the Unicode and the emoji.

'😀' is already a Unicode object. UTF-8 is not Unicode, it's a byte encoding for Unicode. To get the codepoint number of a Unicode character, you can use the <code>ord</code> function. And to print it in the form you want you can format it as hex. Like this: <pre class="prettyprint"><code>s = '😀' print('U+{:X}'.format(ord(s))) </code></pre> output <pre class="prettyprint"><code>U+1F600 </code></pre> If you have Python 3.6+, you can make it even shorter (and more efficient) by using an f-string: <pre class="prettyprint"><code>s = '😀' print(f'U+{ord(s):X}') </code></pre> BTW, if you want to create a Unicode escape sequence like <code>'\U0001F600'</code> there's the <code>'unicode-escape'</code> codec. However, it returns a <code>bytes</code> string, and you may wish to convert that back to text. You could use the 'UTF-8' codec for that, but you might as well just use the 'ASCII' codec, since it's guaranteed to only contain valid ASCII. <pre class="prettyprint"><code>s = '😀' print(s.encode('unicode-escape')) print(s.encode('unicode-escape').decode('ASCII')) </code></pre> output <pre class="prettyprint"><code>b'\\U0001f600' \U0001f600 </code></pre> <hr> I suggest you take a look at this short article by Stack Overflow co-founder Joel Spolsky The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

<pre class="prettyprint"><code>sentence = "Head-Up Displays (HUD)💻 for #automotive🚗 sector\n \nThe #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… " print("normal sentence - ", sentence) uc_sentence = sentence.encode('unicode-escape') print("\n\nunicode represented sentence - ", uc_sentence) decoded_sentence = uc_sentence.decode('unicode-escape') print("\n\ndecoded sentence - ", decoded_sentence) </code></pre> output <pre class="prettyprint"><code>normal sentence - Head-Up Displays (HUD)💻 for #automotive🚗 sector The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… unicode represented sentence - b'Head-Up Displays (HUD)\\U0001f4bb for #automotive\\U0001f697 sector\\n \\nThe #UK-based #startup\\U0001f680 Envisics got \\u20ac42 million #funding\\U0001f4b0 from l\\u2026 ' decoded sentence - Head-Up Displays (HUD)💻 for #automotive🚗 sector The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… </code></pre>

Converting emojis to Unicode and vice versa in python 3

Tags:

python

formatting

unicode

emoji

I am trying to convert an emoji into its Unicode in python 3. For example I would have the emoji 😀 and from this would like to get the corresponding unicode 'U+1F600'. Similarly I would like to convert the 'U+1F600' back to 😀. Now I have read the documentation and tried several options but pythons behaviour confuses me here.

>>> x = '😀'
>>> y = x.encode('utf-8')
>>> y
b'\xf0\x9f\x98\x80'

The emoji is converted to a byte object.

>>> z = y.decode('utf-8')
>>> z
'😀'

Converted the byte object back to the emoji, so far so good.

Now, taking the unicode for the emoji:

>>> c = '\U0001F600'
>>> d = c.encode('utf-8')
>>> d
>>> b'\xf0\x9f\x98\x80'

This prints out the byte encoding again.

>>> d.decode('utf-8')
>>> '😀'

This prints the emoji out again. I really can't figure out how to convert solely between the Unicode and the emoji.

428

asked Dec 08 '17 14:12

imc

2 Answers

'😀' is already a Unicode object. UTF-8 is not Unicode, it's a byte encoding for Unicode. To get the codepoint number of a Unicode character, you can use the ord function. And to print it in the form you want you can format it as hex. Like this:

s = '😀'
print('U+{:X}'.format(ord(s)))

output

U+1F600

If you have Python 3.6+, you can make it even shorter (and more efficient) by using an f-string:

s = '😀'
print(f'U+{ord(s):X}')

BTW, if you want to create a Unicode escape sequence like '\U0001F600' there's the 'unicode-escape' codec. However, it returns a bytes string, and you may wish to convert that back to text. You could use the 'UTF-8' codec for that, but you might as well just use the 'ASCII' codec, since it's guaranteed to only contain valid ASCII.

s = '😀'
print(s.encode('unicode-escape'))
print(s.encode('unicode-escape').decode('ASCII'))

output

b'\\U0001f600'
\U0001f600

I suggest you take a look at this short article by Stack Overflow co-founder Joel Spolsky The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

180

answered Oct 09 '22 18:10

PM 2Ring

sentence = "Head-Up Displays (HUD)💻 for #automotive🚗 sector\n \nThe #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… "
print("normal sentence - ", sentence)

uc_sentence = sentence.encode('unicode-escape')
print("\n\nunicode represented sentence - ", uc_sentence)

decoded_sentence = uc_sentence.decode('unicode-escape')
print("\n\ndecoded sentence - ", decoded_sentence)

output

normal sentence -  Head-Up Displays (HUD)💻 for #automotive🚗 sector
 
The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l… 


unicode represented sentence -  b'Head-Up Displays (HUD)\\U0001f4bb for #automotive\\U0001f697 sector\\n \\nThe #UK-based #startup\\U0001f680 Envisics got \\u20ac42 million #funding\\U0001f4b0 from l\\u2026 '


decoded sentence -  Head-Up Displays (HUD)💻 for #automotive🚗 sector
 
The #UK-based #startup🚀 Envisics got €42 million #funding💰 from l…

answered Oct 09 '22 17:10

Pratyush Behera

Related questions
                            
                                static inline functions in a header file
                            
                                JOIN same table twice with aliases on SQLAlchemy
                            
                                `filesystem` with c++17 doesn't work on my mac os x high sierra
                            
                                Ansible - print gathered facts for debugging purposes [duplicate]
                            
                                Cloud Functions: How to copy Firestore Collection to a new document?
                            
                                Upgrade from Angular 5.2 to 6.1
                            
                                Check if value is 0 with extension method
                            
                                How to create a table corresponding to enum in EF Core Code First?
                            
                                How to re-render a component manually Angular 5
                            
                                Why is `dataclasses.asdict(obj)` > 10x slower than `obj.__dict__()`
                            
                                The ordinal 242 could not be located in the dynamic link library Anaconda3\Library\bin\mkl_intel_thread.dll
                            
                                "Computed" property in Typescript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting emojis to Unicode and vice versa in python 3

Tags:

python

formatting

unicode

emoji

imc

People also ask

2 Answers

PM 2Ring

Pratyush Behera

Recent Activity

Donate For Us