I've been doing a bunch of reading on unicode encodings, especially with regards to Python. I think I have a pretty strong understanding of it now, but there's still one small detail I'm a little unsure about.
How does the decoding know the byte boundaries? For example, say I have a unicode string with two unicode characters with byte representations of \xc6\xb4
and \xe2\x98\x82
, respectively. I then write this unicode string to a file, so the file now contains the bytes
\xc6\xb4\xe2\x98\x82
. Now I decide to open and read the file (and Python defaults to decoding the file as utf-8), which leads me to my main question.
How does the decoding know to interpret the bytes \xc6\xb4
and not \xc6\xb4\xe2
?
The byte boundaries are easily determined from the bit patterns. In your case, \xc6
starts with the bits 1100
, and \xe2
starts with 1110
. In UTF-8 (and I'm pretty sure this is not an accident), you can determine the number of bytes in the whole character by looking only at the first byte and counting the number of 1
bits at the start before the first 0
. So your first character has 2 bytes and the second one has 3 bytes.
If a byte starts with 0
, it is a regular ASCII character.
If a byte starts with 10
, it is part of a UTF-8 sequence (not the first character).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With