Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the en-dash written as '\xe2\x80\x93' in Python?

Specifically, what does each escape in \xe2\x80\x93 do and why does it need 3 escapes? Trying to decode one by itself leads to an 'unexpected end of data' error.

>>> print(b'\xe2\x80\x93'.decode('utf-8'))
–
>>> print(b'\xe2'.decode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data
like image 539
kiri Avatar asked Apr 30 '15 12:04

kiri


1 Answers

You have UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint encodes to those 3 bytes when encoded to that codec.

Trying to decode just one such byte as UTF-8 doesn't work because in the UTF-8 standard that one byte does not, on its own, carry meaning. In the UTF-8 encoding scheme, a \xe2 byte is used for all codepoints between U+2000 and U+2FFF in the Unicode standard (which would all be encoded with an additional 2 bytes); thats 4095 codepoints in all.

Python represents values in a bytes object in a manner that lets you reproduce the value by copying it back into a Python script or terminal. Anything that isn't printable ASCII is then represented by a \xhh hex escape. The two characters form the hexadecimal value of the byte, an integer number between 0 and 255.

Hexadecimal is a very helpful way to represent bytes because you can represent the 2 pairs of 4 bytes each with one character, a digit in the range 0 - F.

\xe2\x80\x93 then means there are three bytes, with the hexadecimal values E2, 80 and 93, or 226, 128 and 147 in decimal, respectively. The UTF-8 standard tells a decoder to take the last 4 bits of the first byte, and the last 6 bytes of each of the second and third bytes (the remaining bits are used to signal what type of byte you are dealing with for error handling). Those 4 + 6 + 6 == 16 bits then encode the hex value 2013 (0010 000000 010011 in binary).

You probably want to read up about the difference between codecs (encodings) and Unicode; UTF-8 is a codec that can handle all of the Unicode standard, but is not the same thing. See:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • Pragmatic Unicode by Ned Batchelder

  • The Python Unicode HOWTO

like image 187
Martijn Pieters Avatar answered Sep 19 '22 14:09

Martijn Pieters