Given a Unicode string encoded in UTF-8, which is just bytes in memory.
If a computer wants to convert these bytes to its corresponding Unicode codepoints (numbers), how can it know where one character ends and another one begins? Some characters are represented by 1 byte, others by up to 6 bytes. So if you have
00111101 10111001
This could represent 2 characters, or 1. How does the computer decide which it is to interpret it correctly? Is there some sort of convention from which we can know from the first byte how many bytes the current character uses or something?
The first byte of a multibyte sequence encodes the length of the sequence in the number of leading 1-bits:
0xxxxxxx
is a character on its own;10xxxxxx
is a continuation of a multibyte character;110xxxxx
is the first byte of a 2-byte character;1110xxxx
is the first byte of a 3-byte character;11110xxx
is the first byte of a 4-byte character.Bytes with more than 4 leading 1-bits don't encode valid characters in UTF-8 because the 4-byte sequences already cover more than the entire Unicode range from U+0000 to U+10FFFF.
So, the example posed in the question has one ASCII character and one continuation byte that doesn't encode a character on its own.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With