I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.
e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
UTF-8 (Unicode Transformation–8-bit) is an encoding defined by the International Organization for Standardization (ISO) in ISO 10646. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.
UTF-8 encodes Unicode characters into a sequence of 8-bit bytes. The standard has a capacity for over a million distinct codepoints and is a superset of all characters in widespread use today. By comparison, ASCII (American Standard Code for Information Interchange) includes 128 character codes.
In Python 3.x, str
is the class for Unicode text, and bytes
is for containing octets.
If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes
like this:
>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'
You can then convert to str
(ie: Unicode) using the str
constructor...
>>> str(b'\xc5\x81', 'utf-8')
'Ł'
...or by calling .decode('utf-8')
on the bytes
object:
>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'
Prior to 3.x, the str
type was a byte array, and unicode
was for Unicode text.
Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:
>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'
You can then convert to unicode
using the constructor...
>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'
...or by calling .decode('utf-8')
on the str
:
>>> '\xc5\x81'.decode('utf-8')
u'\u0141'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With