public class UTF8 { public static void main(String[] args){ String s = "ヨ"; //0xFF6E System.out.println(s.getBytes().length);//length of the string System.out.println(s.charAt(0));//first character in the string } }
output:
3 ヨ
Please help me understand this. Trying to understand how utf8 encoding works in java. As per java doc definition of char char: The char data type is a single 16-bit Unicode character.
Does it mean char type in java can only support those unicode characters that can be represented with 2 bytes and not more than that?
In the above program, the no of bytes allocated for that string is 3 but in the third line which returns first character( 2 bytes in java) can hold a character which is 3 bytes long? really confused here?
Any good references regarding this concept in java/ general would be really appreciated.
String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data. There are two general types of encoding errors.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL).
Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.
Nothing in your code example is directly using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.
If you do not pass a parameter value to String.getBytes()
, it returns a byte array that has the String
contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8")
instead.
Calling String.charAt()
returns an original UTF-16 encoded char from the String's in-memory storage only.
So in your example, the Unicode character ョ
is stored in the String
in-memory storage using two bytes that are UTF-16 encoded (0x6E 0xFF
or 0xFF 0x6E
depending on endian), but is stored in the byte array from getBytes()
using three bytes that are encoded using whatever the OS default charset is.
In UTF-8, that particular Unicode character happens to use 3 bytes as well (0xEF 0xBD 0xAE
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With