Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 utf-8 encoding seem to be wrong?

I've messed in the past with Python 3.2 but now I face a somewhat confusing situation about utf-8 encoding in python.
For example, say I have this piece of code:

'א'.encode()

The result is b'\xd7\x90' (or 0xD790), this, however, is wrong: the utf-8 encoding of the Hebrew character Alef is supposed to be 0x5D0.
However, using utf-16 as the encoding returns the correct hex value, with a prefix of 0xFFFE:

'א'.encode('utf-16')

this returns b'\xff\xfe\xd0\x05'.

I feel as if I'm missing something fundamental in my understanding,
SO users, please help educate me!

like image 406
GZaidman Avatar asked May 18 '26 07:05

GZaidman


1 Answers

The unicode codepoint of א is U+05D0, or 101 1101 0000 in binary. The UTF-8 encoding of an 11-bit codepoint ABCDEFGHIJK is

110A BCDE  10FG HIJK
# i.e.
1101 0111  1001 0000 # binary
 d    7     9    0   # hex

or, in Python notation, b'\xd7\x90'.

like image 199
phihag Avatar answered May 19 '26 21:05

phihag