Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a character from its UTF-16 code points in Python 3?

I have a list of UTF-16 code points that I need to convert to the actual characters they represent programmatically. This seems unbelievably hard to do in Python 3.

For example, I have the numbers 55357 and 56501 for one character, which I know is this banknote emoji: 💵 But I have no idea how to convert that in Python. I first tried chr(55357) + chr(56501), but Python seems to assume that it is UTF-8 encoded and thus gives me broken Unicode.

I then tried re-encoding the string, but since it's broken UTF-8, it gives me what seems to be broken UTF-16. If I tell it to leave it alone with (chr(55357) + chr(56501)).encode('utf-8', 'surrogatepass'), I can actually get a valid bytes of the character, but it's encoded in...CESU-8, for reasons I cannot yet grasp. This is not an encoding Python supports natively, and I can't find a codec to convert it.

I think I could probably write these to the disk and then read them with the right encoding, but that sounds really terrible.

Is there a reasonable way to do this in Python 3?

like image 768
Ullallulloo Avatar asked Feb 12 '19 06:02

Ullallulloo


People also ask

How do I get a Unicode character in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

How do you get the UTF-8 character code in Python?

UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point. In Python 2, chr only supports ASCII, so only numbers in the [0.. 255] range.

What is a UTF-16 character?

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.

Does Python 3 have Unicode?

Since Python 3.0, the language's str type contains Unicode characters, meaning any string created using "unicode rocks!" , 'unicode rocks!'

What are Unicode code points in Python?

These 137k characters are each represented by a unicode code point. So unicode code points refer to actual characters that are displayed. These code points are encoded to bytes and decoded from bytes back to code points.

How to convert Unicode to UTF-8 in Python?

Run your processing on unicode code points through your Python code, and then write back into bytes into a file using UTF-8 encoder in the end. This is called Unicode Sandwich. Read/watch the excellent talk by Ned Batchelder (@ nedbat) about this.

How many bytes is a Unicode character in Python?

A given Unicode character can occupy anywhere from one to four bytes. Here’s an example of a single Unicode character taking up four bytes: The length of a single Unicode character as a Python str will always be 1, no matter how many bytes it occupies. The length of the same character encoded to bytes will be anywhere between 1 and 4.

How to write the prefix u in a string in Python?

Since python 3 release, it is not necessary to write the prefix u as all the string by default are Unicode string. The method chr () is the inverse of the method ord (). chr () gets the character that a Unicode code point corresponds to.


Video Answer


1 Answers

The trick is not to mess with chr but rather to convert to a byte array, which you can then decode into a string:

a, b = 55357, 56501
x = a.to_bytes(2, 'little') + b.to_bytes(2, 'little')

print(x.decode('UTF-16'))

This can be generalized for any number of integers:

data = [55357, 56501]
b = bytes([x for c in data for x in c.to_bytes(2, 'little')])
result = b.decode('utf-16')

The reason something like chr(55357) + chr(56501) doesn't work is that chr assumes no encoding. It works on the raw Unicode code points, so you are combining two distinct characters. As the other answer points out, you then have to encode this two character string and re-decode it, or just get the bytes and decode once as I'm suggesting.

like image 117
Mad Physicist Avatar answered Sep 17 '22 23:09

Mad Physicist