I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python. e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.

<h3>Python 3.x:</h3> In Python 3.x, <code>str</code> is the class for Unicode text, and <code>bytes</code> is for containing octets. If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to <code>bytes</code> like this: <pre class="prettyprint"><code>>>> bytes(int(x,0) for x in ['0xc5', '0x81']) b'\xc5\x81' </code></pre> You can then convert to <code>str</code> (ie: Unicode) using the <code>str</code> constructor... <pre class="prettyprint"><code>>>> str(b'\xc5\x81', 'utf-8') 'Ł' </code></pre> ...or by calling <code>.decode('utf-8')</code> on the <code>bytes</code> object: <pre class="prettyprint"><code>>>> b'\xc5\x81'.decode('utf-8') 'Ł' >>> hex(ord('Ł')) '0x141' </code></pre> <h3>Pre-3.x:</h3> Prior to 3.x, the <code>str</code> type was a byte array, and <code>unicode</code> was for Unicode text. Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this: <pre class="prettyprint"><code>>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81']) '\xc5\x81' </code></pre> You can then convert to <code>unicode</code> using the constructor... <pre class="prettyprint"><code>>>> unicode('\xc5\x81', 'utf-8') u'\u0141' </code></pre> ...or by calling <code>.decode('utf-8')</code> on the <code>str</code>: <pre class="prettyprint"><code>>>> '\xc5\x81'.decode('utf-8') u'\u0141' </code></pre>

Convert UTF-8 octets to unicode code points

1 Answers

Python 3.x:

In Python 3.x, str is the class for Unicode text, and bytes is for containing octets.

If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes like this:

>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'

You can then convert to str (ie: Unicode) using the str constructor...

>>> str(b'\xc5\x81', 'utf-8')
'Ł'

...or by calling .decode('utf-8') on the bytes object:

>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'

Pre-3.x:

Prior to 3.x, the str type was a byte array, and unicode was for Unicode text.

Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:

>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'

You can then convert to unicode using the constructor...

>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'

...or by calling .decode('utf-8') on the str:

>>> '\xc5\x81'.decode('utf-8')
u'\u0141'

137

answered Sep 30 '22 10:09

Laurence Gonsalves

Related questions
                            
                                Confidence Interval in Python dataframe
                            
                                Overly Function from GeoPandas Not Working
                            
                                How to zip three lists into a nested dict
                            
                                What's the standard way of saving something only if its foreign key exists?
                            
                                Email verification in Django
                            
                                img should be PIL Image. Got <class 'torch.Tensor'>
                            
                                Tensorboard AttributeError: 'ModelCheckpoint' object has no attribute 'on_train_batch_begin'
                            
                                AttributeError: module 'scipy.misc' has no attribute 'toimage'
                            
                                Can random.uniform(0,1) ever generate 0 or 1?
                            
                                How to bypass the message-"your connection is not private" on non-secure page using Selenium?
                            
                                How do I get the external IP of a socket in Python?
                            
                                how can i use sharepoint (via soap?) from python?
                            
                                Multiple database support in django
                            
                                python, sorting a list by a key that's a substring of each element
                            
                                Distributing Ruby/Python desktop apps
                            
                                Returning Matplotlib image as string
                            
                                Django Initialization
                            
                                Why was the 'thread' module renamed to '_thread' in Python 3.x?
                            
                                Python imaging alternatives
                            
                                How to check if permutations have equal parity?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert UTF-8 octets to unicode code points

Tags:

python

unicode

utf-8

Sirish

People also ask

1 Answers

Python 3.x:

Pre-3.x:

Laurence Gonsalves

Recent Activity

Donate For Us