Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can UTF-8 contain zero byte?

Tags:

unicode

Can UTF-8 string contain zerobytes? I'm going to send it over ascii plaintext protocol, should I encode it with something like base64?

like image 648
einclude Avatar asked Aug 02 '11 04:08

einclude


People also ask

Can UTF-8 contain null?

NULL is a valid UTF-8 character. If specific languages and their standard libraries choose to treat it as a string terminator (C, I'm looking at you), well, then fine. But it's still valid Unicode.

How many bytes is a UTF-8?

UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Remember, a unicode character is represented by a unicode code point. Thus, UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode code point.

What does UTF-8 contain?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

What is an invalid UTF-8 character?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.


2 Answers

Yes, the zero byte in UTF8 is code point 0, NUL. There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

The possible code points and their UTF8 encoding are:

Range              Encoding  Binary value -----------------  --------  -------------------------- U+000000-U+00007f  0xxxxxxx  0xxxxxxx  U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx                    10xxxxxx  U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx                    10yyyyxx                    10xxxxxx  U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx                    10zzyyyy                    10yyyyxx                    10xxxxxx 

You can see that all the non-zero ASCII characters are represented as themselves while all mutibyte sequences have a high bit of 1 in all their bytes.

You may need to be careful that your ascii plaintext protocol doesn't treat non-ASCII characters badly (since that will be all non-ASCII code points).

like image 54
paxdiablo Avatar answered Sep 17 '22 13:09

paxdiablo


ASCII text is restricted to byte values between 0 and 127. UTF-8 text has no such restriction - text encoded with UTF-8 may have its high bit set. So it's not safe to send UTF-8 text over a channel that doesn't guarantee safe passage for that high bit.

If you're forced to deal with an ASCII-only channel, Base-64 is a reasonable (though not particularly space-efficient) choice. Are you sure you're limited to 7-bit data, though? That's somewhat unusual in this day.

like image 37
Michael Petrotta Avatar answered Sep 19 '22 13:09

Michael Petrotta