Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When converting a utf-8 encoded string from bytes to characters, how does the computer know where a character ends?

Given a Unicode string encoded in UTF-8, which is just bytes in memory.

If a computer wants to convert these bytes to its corresponding Unicode codepoints (numbers), how can it know where one character ends and another one begins? Some characters are represented by 1 byte, others by up to 6 bytes. So if you have

00111101 10111001

This could represent 2 characters, or 1. How does the computer decide which it is to interpret it correctly? Is there some sort of convention from which we can know from the first byte how many bytes the current character uses or something?

like image 309
Asciiom Avatar asked Mar 28 '13 17:03

Asciiom


1 Answers

The first byte of a multibyte sequence encodes the length of the sequence in the number of leading 1-bits:

  • 0xxxxxxx is a character on its own;
  • 10xxxxxx is a continuation of a multibyte character;
  • 110xxxxx is the first byte of a 2-byte character;
  • 1110xxxx is the first byte of a 3-byte character;
  • 11110xxx is the first byte of a 4-byte character.

Bytes with more than 4 leading 1-bits don't encode valid characters in UTF-8 because the 4-byte sequences already cover more than the entire Unicode range from U+0000 to U+10FFFF.

So, the example posed in the question has one ASCII character and one continuation byte that doesn't encode a character on its own.

like image 186
Joni Avatar answered Oct 13 '22 17:10

Joni