I have a list of UTF-16 code points that I need to convert to the actual characters they represent programmatically. This seems unbelievably hard to do in Python 3.
For example, I have the numbers 55357 and 56501 for one character, which I know is this banknote emoji: 💵 But I have no idea how to convert that in Python. I first tried chr(55357) + chr(56501)
, but Python seems to assume that it is UTF-8 encoded and thus gives me broken Unicode.
I then tried re-encoding the string, but since it's broken UTF-8, it gives me what seems to be broken UTF-16. If I tell it to leave it alone with (chr(55357) + chr(56501)).encode('utf-8', 'surrogatepass')
, I can actually get a valid bytes of the character, but it's encoded in...CESU-8, for reasons I cannot yet grasp. This is not an encoding Python supports natively, and I can't find a codec to convert it.
I think I could probably write these to the disk and then read them with the right encoding, but that sounds really terrible.
Is there a reasonable way to do this in Python 3?
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point. In Python 2, chr only supports ASCII, so only numbers in the [0.. 255] range.
UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.
Since Python 3.0, the language's str type contains Unicode characters, meaning any string created using "unicode rocks!" , 'unicode rocks!'
These 137k characters are each represented by a unicode code point. So unicode code points refer to actual characters that are displayed. These code points are encoded to bytes and decoded from bytes back to code points.
Run your processing on unicode code points through your Python code, and then write back into bytes into a file using UTF-8 encoder in the end. This is called Unicode Sandwich. Read/watch the excellent talk by Ned Batchelder (@ nedbat) about this.
A given Unicode character can occupy anywhere from one to four bytes. Here’s an example of a single Unicode character taking up four bytes: The length of a single Unicode character as a Python str will always be 1, no matter how many bytes it occupies. The length of the same character encoded to bytes will be anywhere between 1 and 4.
Since python 3 release, it is not necessary to write the prefix u as all the string by default are Unicode string. The method chr () is the inverse of the method ord (). chr () gets the character that a Unicode code point corresponds to.
The trick is not to mess with chr
but rather to convert to a byte array, which you can then decode into a string:
a, b = 55357, 56501
x = a.to_bytes(2, 'little') + b.to_bytes(2, 'little')
print(x.decode('UTF-16'))
This can be generalized for any number of integers:
data = [55357, 56501]
b = bytes([x for c in data for x in c.to_bytes(2, 'little')])
result = b.decode('utf-16')
The reason something like chr(55357) + chr(56501)
doesn't work is that chr
assumes no encoding. It works on the raw Unicode code points, so you are combining two distinct characters. As the other answer points out, you then have to encode this two character string and re-decode it, or just get the bytes and decode once as I'm suggesting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With