Can UTF-8 string contain zerobytes? I'm going to send it over ascii plaintext protocol, should I encode it with something like base64?

ASCII text is restricted to byte values between 0 and 127. UTF-8 text has no such restriction - text encoded with UTF-8 may have its high bit set. So it's not safe to send UTF-8 text over a channel that doesn't guarantee safe passage for that high bit. If you're forced to deal with an ASCII-only channel, Base-64 is a reasonable (though not particularly space-efficient) choice. Are you sure you're limited to 7-bit data, though? That's somewhat unusual in this day.

Can UTF-8 contain zero byte?

2 Answers

Yes, the zero byte in UTF8 is code point 0, NUL. There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

The possible code points and their UTF8 encoding are:

Range              Encoding  Binary value -----------------  --------  -------------------------- U+000000-U+00007f  0xxxxxxx  0xxxxxxx  U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx                    10xxxxxx  U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx                    10yyyyxx                    10xxxxxx  U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx                    10zzyyyy                    10yyyyxx                    10xxxxxx

You can see that all the non-zero ASCII characters are represented as themselves while all mutibyte sequences have a high bit of 1 in all their bytes.

You may need to be careful that your ascii plaintext protocol doesn't treat non-ASCII characters badly (since that will be all non-ASCII code points).

answered Sep 17 '22 13:09

paxdiablo

ASCII text is restricted to byte values between 0 and 127. UTF-8 text has no such restriction - text encoded with UTF-8 may have its high bit set. So it's not safe to send UTF-8 text over a channel that doesn't guarantee safe passage for that high bit.

If you're forced to deal with an ASCII-only channel, Base-64 is a reasonable (though not particularly space-efficient) choice. Are you sure you're limited to 7-bit data, though? That's somewhat unusual in this day.

answered Sep 19 '22 13:09

Michael Petrotta

Related questions
                            
                                Import Package Error - Cannot Convert between Unicode and Non Unicode String Data Type
                            
                                How to find out if Python is compiled with UCS-2 or UCS-4?
                            
                                How to make Unicode charset in cmd.exe by default?
                            
                                Why are "control" characters illegal in XML 1.0?
                            
                                Where can I find a list of language + region codes?
                            
                                How to convert \uXXXX unicode to UTF-8 using console tools in *nix
                            
                                😃 (and other Unicode characters) in identifiers not allowed by g++
                            
                                WChars, Encodings, Standards and Portability
                            
                                Why does Java permit escaped unicode characters in the source code?
                            
                                How to extract text from the PDF document? [closed]
                            
                                How to open an std::fstream (ofstream or ifstream) with a unicode filename?
                            
                                Why is this LSEP symbol showing up on Chrome and not Firefox or Edge?
                            
                                Python regex matching Unicode properties
                            
                                How does uʍop-ǝpᴉsdn text work?
                            
                                How to match Cyrillic characters with a regular expression
                            
                                Regular Expression Arabic characters and numbers only
                            
                                How to get rid of non-ascii characters in ruby
                            
                                removing emojis from a string in Python
                            
                                Regex to match Egyptian Hieroglyphics [closed]
                            
                                Should I use accented characters in URLs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can UTF-8 contain zero byte?

Tags:

unicode

einclude

People also ask

2 Answers

paxdiablo

Michael Petrotta

Recent Activity

Donate For Us