Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert an int representing a UTF-8 character into a Unicode code point?

Let us use the character Latin Capital Letter a with Ogonek (U+0104) as an example.

I have an int that represents its UTF-8 encoded form:

my_int = 0xC484
# Decimal: `50308`
# Binary: `0b1100010010000100`

If use the unichr function i get: \uC484 or (U+C484)

But, I need it to output: Ą

How do I convert my_int to a Unicode code point?

like image 908
A. K. Tolentino Avatar asked Mar 26 '15 08:03

A. K. Tolentino


1 Answers

To convert the integer 0xC484 to the bytestring '\xc4\x84' (the UTF-8 representation of the Unicode character Ą), you can use struct.pack():

>>> import struct
>>> struct.pack(">H", 0xC484)
'\xc4\x84'

... where > in the format string represents big-endian, and H represents unsigned short int.

Once you have your UTF-8 bytestring, you can decode it to Unicode as usual:

>>> struct.pack(">H", 0xC484).decode("utf8")
u'\u0104'

>>> print struct.pack(">H", 0xC484).decode("utf8")
Ą
like image 186
Zero Piraeus Avatar answered Sep 22 '22 13:09

Zero Piraeus