Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java UTF-8 differences

Tags:

java

utf-8

The JavaDoc says "The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls."

But what does this even mean? What's an embedded null in this context? I am trying to convert from a Java saved UTF-8 string to "real" UTF-8.

like image 933
Prof. Falken Avatar asked Jun 22 '11 12:06

Prof. Falken


People also ask

What's the difference between UTF-8 and UTF-16?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Does Java use UTF-8 or UTF-16?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.

What is UTF-8 UTF-16 UTF-32?

UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.

Which is better UTF-8 or UTF-16?

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.


1 Answers

In C a string is terminated by the byte value 00.

The thing here is that you can have 0-chars in Java strings but to avoid confusion when passing the string over to C (which all native methods are written in) the character is encoded in another way, namely as two bytes

11000000 10000000

(according to the javadoc) neither of which is actually 00.

This is a hack to work around something you cannot change easily.

Also note, that this is valid UTF-8 and decode correctly to 00.

like image 138
Thorbjørn Ravn Andersen Avatar answered Oct 14 '22 16:10

Thorbjørn Ravn Andersen