What's a good terminator byte for UTF-8 data?

Question

I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I'd prefer to use a terminator at the end of my byte arrays, rather than (say) a byte-length prefix.

What terminator should I use? It seems 0xff is an illegal byte in all positions of any UTF-8 string, but perhaps someone knows concretely?

bames53 · Accepted Answer

0xFF and 0xFE cannot appear in legal UTF-8 data. Also the bytes 0xF8-0xFD will only appear in the obsolete version of UTF-8 that allows up to six byte sequences.

0x00 is legal but won't appear anywhere except in the encoding of U+0000. This is exactly the same as other encodings, and the fact that it's legal in all these encodings never stopped it from being used as a terminator in C strings. I'd probably go with 0x00.

What's a good terminator byte for UTF-8 data?

Tags:

unicode

utf-8

phs

1 Answers

bames53

Recent Activity

Donate For Us

What's a good terminator byte for UTF-8 data?

Tags:

unicode

utf-8

phs

1 Answers

bames53

Related questions

Recent Activity

Donate For Us