Why is the en-dash written as '\xe2\x80\x93' in Python?

Question

Specifically, what does each escape in \xe2\x80\x93 do and why does it need 3 escapes? Trying to decode one by itself leads to an 'unexpected end of data' error.

>>> print(b'\xe2\x80\x93'.decode('utf-8'))
–
>>> print(b'\xe2'.decode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

Martijn Pieters · Accepted Answer

You have UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint encodes to those 3 bytes when encoded to that codec.

Trying to decode just one such byte as UTF-8 doesn't work because in the UTF-8 standard that one byte does not, on its own, carry meaning. In the UTF-8 encoding scheme, a \xe2 byte is used for all codepoints between U+2000 and U+2FFF in the Unicode standard (which would all be encoded with an additional 2 bytes); thats 4095 codepoints in all.

Python represents values in a bytes object in a manner that lets you reproduce the value by copying it back into a Python script or terminal. Anything that isn't printable ASCII is then represented by a \xhh hex escape. The two characters form the hexadecimal value of the byte, an integer number between 0 and 255.

Hexadecimal is a very helpful way to represent bytes because you can represent the 2 pairs of 4 bytes each with one character, a digit in the range 0 - F.

\xe2\x80\x93 then means there are three bytes, with the hexadecimal values E2, 80 and 93, or 226, 128 and 147 in decimal, respectively. The UTF-8 standard tells a decoder to take the last 4 bits of the first byte, and the last 6 bytes of each of the second and third bytes (the remaining bits are used to signal what type of byte you are dealing with for error handling). Those 4 + 6 + 6 == 16 bits then encode the hex value 2013 (0010 000000 010011 in binary).

You probably want to read up about the difference between codecs (encodings) and Unicode; UTF-8 is a codec that can handle all of the Unicode standard, but is not the same thing. See:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

Why is the en-dash written as '\xe2\x80\x93' in Python?

Tags:

python

encoding

unicode

utf-8

kiri

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

Why is the en-dash written as '\xe2\x80\x93' in Python?

Tags:

python

encoding

unicode

utf-8

kiri

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us