The JavaDoc says "The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls."
But what does this even mean? What's an embedded null in this context? I am trying to convert from a Java saved UTF-8 string to "real" UTF-8.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.
UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.
UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.
In C a string is terminated by the byte value 00.
The thing here is that you can have 0-chars in Java strings but to avoid confusion when passing the string over to C (which all native methods are written in) the character is encoded in another way, namely as two bytes
11000000 10000000
(according to the javadoc) neither of which is actually 00.
This is a hack to work around something you cannot change easily.
Also note, that this is valid UTF-8 and decode correctly to 00.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With