How does decoding in UTF-8 know the byte boundaries?

Question

I've been doing a bunch of reading on unicode encodings, especially with regards to Python. I think I have a pretty strong understanding of it now, but there's still one small detail I'm a little unsure about.

How does the decoding know the byte boundaries? For example, say I have a unicode string with two unicode characters with byte representations of \xc6\xb4 and \xe2\x98\x82, respectively. I then write this unicode string to a file, so the file now contains the bytes \xc6\xb4\xe2\x98\x82. Now I decide to open and read the file (and Python defaults to decoding the file as utf-8), which leads me to my main question.

How does the decoding know to interpret the bytes \xc6\xb4 and not \xc6\xb4\xe2?

Greg Hewgill · Accepted Answer

The byte boundaries are easily determined from the bit patterns. In your case, \xc6 starts with the bits 1100, and \xe2 starts with 1110. In UTF-8 (and I'm pretty sure this is not an accident), you can determine the number of bytes in the whole character by looking only at the first byte and counting the number of 1 bits at the start before the first 0. So your first character has 2 bytes and the second one has 3 bytes.

If a byte starts with 0, it is a regular ASCII character.

If a byte starts with 10, it is part of a UTF-8 sequence (not the first character).

If a byte starts with 0, it is a regular ASCII character.

If a byte starts with 10, it is part of a UTF-8 sequence (not the first character).

How does decoding in UTF-8 know the byte boundaries?

Tags:

python

utf-8

decode

btse

1 Answers

Greg Hewgill

Recent Activity

Donate For Us

How does decoding in UTF-8 know the byte boundaries?

Tags:

python

utf-8

decode

btse

1 Answers

Greg Hewgill

Related questions

Recent Activity

Donate For Us