I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I'd prefer to use a terminator at the end of my byte arrays, rather than (say) a byte-length prefix.
What terminator should I use? It seems 0xff
is an illegal byte in all positions of any UTF-8 string, but perhaps someone knows concretely?
0xFF
and 0xFE
cannot appear in legal UTF-8 data. Also the bytes 0xF8
-0xFD
will only appear in the obsolete version of UTF-8 that allows up to six byte sequences.
0x00
is legal but won't appear anywhere except in the encoding of U+0000. This is exactly the same as other encodings, and the fact that it's legal in all these encodings never stopped it from being used as a terminator in C strings. I'd probably go with 0x00
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With