Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a good terminator byte for UTF-8 data?

Tags:

unicode

utf-8

I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I'd prefer to use a terminator at the end of my byte arrays, rather than (say) a byte-length prefix.

What terminator should I use? It seems 0xff is an illegal byte in all positions of any UTF-8 string, but perhaps someone knows concretely?

like image 403
phs Avatar asked Jan 18 '12 20:01

phs


1 Answers

0xFF and 0xFE cannot appear in legal UTF-8 data. Also the bytes 0xF8-0xFD will only appear in the obsolete version of UTF-8 that allows up to six byte sequences.

0x00 is legal but won't appear anywhere except in the encoding of U+0000. This is exactly the same as other encodings, and the fact that it's legal in all these encodings never stopped it from being used as a terminator in C strings. I'd probably go with 0x00.

like image 91
bames53 Avatar answered Oct 11 '22 19:10

bames53