Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does decoding in UTF-8 know the byte boundaries?

I've been doing a bunch of reading on unicode encodings, especially with regards to Python. I think I have a pretty strong understanding of it now, but there's still one small detail I'm a little unsure about.

How does the decoding know the byte boundaries? For example, say I have a unicode string with two unicode characters with byte representations of \xc6\xb4 and \xe2\x98\x82, respectively. I then write this unicode string to a file, so the file now contains the bytes \xc6\xb4\xe2\x98\x82. Now I decide to open and read the file (and Python defaults to decoding the file as utf-8), which leads me to my main question.

How does the decoding know to interpret the bytes \xc6\xb4 and not \xc6\xb4\xe2?

like image 944
btse Avatar asked Jun 09 '14 04:06

btse


1 Answers

The byte boundaries are easily determined from the bit patterns. In your case, \xc6 starts with the bits 1100, and \xe2 starts with 1110. In UTF-8 (and I'm pretty sure this is not an accident), you can determine the number of bytes in the whole character by looking only at the first byte and counting the number of 1 bits at the start before the first 0. So your first character has 2 bytes and the second one has 3 bytes.

If a byte starts with 0, it is a regular ASCII character.

If a byte starts with 10, it is part of a UTF-8 sequence (not the first character).

like image 137
Greg Hewgill Avatar answered Nov 14 '22 23:11

Greg Hewgill