I've messed in the past with Python 3.2 but now I face a somewhat confusing situation about utf-8 encoding in python.
For example, say I have this piece of code:
'א'.encode()
The result is b'\xd7\x90' (or 0xD790), this, however, is wrong: the utf-8 encoding of the Hebrew character Alef is supposed to be 0x5D0.
However, using utf-16 as the encoding returns the correct hex value, with a prefix of 0xFFFE:
'א'.encode('utf-16')
this returns b'\xff\xfe\xd0\x05'.
I feel as if I'm missing something fundamental in my understanding,
SO users, please help educate me!
The unicode codepoint of א is U+05D0, or 101 1101 0000 in binary. The UTF-8 encoding of an 11-bit codepoint ABCDEFGHIJK is
110A BCDE 10FG HIJK
# i.e.
1101 0111 1001 0000 # binary
d 7 9 0 # hex
or, in Python notation, b'\xd7\x90'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With