Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert UTF-8 octets to unicode code points

I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.

e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.

like image 843
Sirish Avatar asked Dec 08 '09 04:12

Sirish


People also ask

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

What is the range of Unicode points of UTF-8 bits?

UTF-8 (Unicode Transformation–8-bit) is an encoding defined by the International Organization for Standardization (ISO) in ISO 10646. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.

Is UTF-8 ASCII or Unicode?

UTF-8 encodes Unicode characters into a sequence of 8-bit bytes. The standard has a capacity for over a million distinct codepoints and is a superset of all characters in widespread use today. By comparison, ASCII (American Standard Code for Information Interchange) includes 128 character codes.


1 Answers

Python 3.x:

In Python 3.x, str is the class for Unicode text, and bytes is for containing octets.

If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes like this:

>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'

You can then convert to str (ie: Unicode) using the str constructor...

>>> str(b'\xc5\x81', 'utf-8')
'Ł'

...or by calling .decode('utf-8') on the bytes object:

>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'

Pre-3.x:

Prior to 3.x, the str type was a byte array, and unicode was for Unicode text.

Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:

>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'

You can then convert to unicode using the constructor...

>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'

...or by calling .decode('utf-8') on the str:

>>> '\xc5\x81'.decode('utf-8')
u'\u0141'
like image 137
Laurence Gonsalves Avatar answered Sep 30 '22 10:09

Laurence Gonsalves