Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 Continuation bytes

Tags:

unicode

utf-8

I'm trying to figure out what "continuation bytes" are (for curiousity sake) in the UTF-8 encoding.

Wikipedia introduces this term in the UTF-8 article without defining it at all

Google search returns no useful information either. I'm about to jump into the official specification, but would preferably read a high-level summary first.

like image 980
14 revs, 12 users 16% Avatar asked Feb 20 '12 04:02

14 revs, 12 users 16%


People also ask

What is a continuation byte?

A continuation byte in UTF-8 is any byte where the top two bits are 10 . They are the subsequent bytes in multi-byte sequences.

How many bytes is a UTF-8?

UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. UTF-8 has the following properties: The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.

What are UTF-8 bytes?

UTF-8 is a variable-width character encoding standard that uses between one and four eight-bit bytes to represent all valid Unicode code points.

Is UTF-8 a multi byte?

UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.

Is UTF-8 self-synchronizing?

UTF-8 is self-synchronizing. Let's call a byte of the form 10 xxxxxx a continuation byte. Every UTF-8 sequence is a byte that is not a continuation byte followed by zero or more continuation bytes.

How many bytes is UTF-8?

Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point. The following table shows the structure of the encoding.

What is a valid UTF-8 string?

Fallback and auto-detection: Only a small subset of possible byte strings are a valid UTF-8 string: the bytes C0, C1, and F5 through FF cannot appear, and bytes with the high bit set must be in pairs, and other requirements. It is extremely unlikely that a readable text in any extended ASCII is valid UTF-8.

What is a continuation byte?

In short words, continuation bytes are the bytes except first byte or single byte. In UTF-8, continuation bytes are started with 0x10. Show activity on this post. “Continuation byte” isn’t a term but a normal English word and the term “byte.” If used as a pseudo-term, it may confuse the reader.


2 Answers

A continuation byte in UTF-8 is any byte where the top two bits are 10.

They are the subsequent bytes in multi-byte sequences. The following table may help:

Unicode code points  Encoding  Binary value -------------------  --------  ------------  U+000000-U+00007f   0xxxxxxx  0xxxxxxx   U+000080-U+0007ff   110yyyxx  00000yyy xxxxxxxx                      10xxxxxx   U+000800-U+00ffff   1110yyyy  yyyyyyyy xxxxxxxx                      10yyyyxx                      10xxxxxx   U+010000-U+10ffff   11110zzz  000zzzzz yyyyyyyy xxxxxxxx                      10zzyyyy                      10yyyyxx                      10xxxxxx 

Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.

The basic rules are this:

  1. If a byte starts with a 0 bit, it's a single byte value less than 128.
  2. If it starts with 11, it's the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
  3. If it starts with 10, it's a continuation byte.

This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.

Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.

like image 87
paxdiablo Avatar answered Sep 19 '22 16:09

paxdiablo


In short words, continuation bytes are the bytes except first byte or single byte. In UTF-8, continuation bytes are started with 0x10.

like image 40
rogerz Avatar answered Sep 19 '22 16:09

rogerz