Specifically, what does each escape in \xe2\x80\x93
do and why does it need 3 escapes? Trying to decode one by itself leads to an 'unexpected end of data' error.
>>> print(b'\xe2\x80\x93'.decode('utf-8'))
–
>>> print(b'\xe2'.decode('utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data
You have UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint encodes to those 3 bytes when encoded to that codec.
Trying to decode just one such byte as UTF-8 doesn't work because in the UTF-8 standard that one byte does not, on its own, carry meaning. In the UTF-8 encoding scheme, a \xe2
byte is used for all codepoints between U+2000 and U+2FFF in the Unicode standard (which would all be encoded with an additional 2 bytes); thats 4095 codepoints in all.
Python represents values in a bytes
object in a manner that lets you reproduce the value by copying it back into a Python script or terminal. Anything that isn't printable ASCII is then represented by a \xhh
hex escape. The two characters form the hexadecimal value of the byte, an integer number between 0 and 255.
Hexadecimal is a very helpful way to represent bytes because you can represent the 2 pairs of 4 bytes each with one character, a digit in the range 0 - F.
\xe2\x80\x93
then means there are three bytes, with the hexadecimal values E2, 80 and 93, or 226, 128 and 147 in decimal, respectively. The UTF-8 standard tells a decoder to take the last 4 bits of the first byte, and the last 6 bytes of each of the second and third bytes (the remaining bits are used to signal what type of byte you are dealing with for error handling). Those 4 + 6 + 6 == 16 bits then encode the hex value 2013 (0010 000000 010011
in binary).
You probably want to read up about the difference between codecs (encodings) and Unicode; UTF-8 is a codec that can handle all of the Unicode standard, but is not the same thing. See:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With