I'm trying to figure out what "continuation bytes" are (for curiousity sake) in the UTF-8 encoding.
Wikipedia introduces this term in the UTF-8 article without defining it at all
Google search returns no useful information either. I'm about to jump into the official specification, but would preferably read a high-level summary first.
A continuation byte in UTF-8 is any byte where the top two bits are 10 . They are the subsequent bytes in multi-byte sequences.
UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. UTF-8 has the following properties: The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
UTF-8 is a variable-width character encoding standard that uses between one and four eight-bit bytes to represent all valid Unicode code points.
UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.
UTF-8 is self-synchronizing. Let's call a byte of the form 10 xxxxxx a continuation byte. Every UTF-8 sequence is a byte that is not a continuation byte followed by zero or more continuation bytes.
Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point. The following table shows the structure of the encoding.
Fallback and auto-detection: Only a small subset of possible byte strings are a valid UTF-8 string: the bytes C0, C1, and F5 through FF cannot appear, and bytes with the high bit set must be in pairs, and other requirements. It is extremely unlikely that a readable text in any extended ASCII is valid UTF-8.
In short words, continuation bytes are the bytes except first byte or single byte. In UTF-8, continuation bytes are started with 0x10. Show activity on this post. “Continuation byte” isn’t a term but a normal English word and the term “byte.” If used as a pseudo-term, it may confuse the reader.
A continuation byte in UTF-8 is any byte where the top two bits are 10
.
They are the subsequent bytes in multi-byte sequences. The following table may help:
Unicode code points Encoding Binary value ------------------- -------- ------------ U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx 10zzyyyy 10yyyyxx 10xxxxxx
Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.
The basic rules are this:
0
bit, it's a single byte value less than 128.11
, it's the first byte of a multi-byte sequence and the number of 1
bits at the start indicates how many bytes there are in total (110xxxxx
has two bytes, 1110xxxx
has three and 11110xxx
has four).10
, it's a continuation byte.This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10
bits.
Similarly, it can also be used for a UTF-8 strlen
by only counting non-10xxxxxx
bytes.
In short words, continuation bytes are the bytes except first byte or single byte. In UTF-8, continuation bytes are started with 0x10.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With