Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java utf8 encoding - char, string types

Tags:

public class UTF8 {     public static void main(String[] args){         String s = "ヨ"; //0xFF6E         System.out.println(s.getBytes().length);//length of the string         System.out.println(s.charAt(0));//first character in the string     } } 

output:

3 ヨ 

Please help me understand this. Trying to understand how utf8 encoding works in java. As per java doc definition of char char: The char data type is a single 16-bit Unicode character.

Does it mean char type in java can only support those unicode characters that can be represented with 2 bytes and not more than that?

In the above program, the no of bytes allocated for that string is 3 but in the third line which returns first character( 2 bytes in java) can hold a character which is 3 bytes long? really confused here?

Any good references regarding this concept in java/ general would be really appreciated.

like image 879
akd Avatar asked Aug 29 '12 22:08

akd


People also ask

Are Java strings UTF-8?

String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data. There are two general types of encoding errors.

How does UTF-8 represent different types of characters?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

What characters are UTF-8?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL).

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.


1 Answers

Nothing in your code example is directly using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.

If you do not pass a parameter value to String.getBytes(), it returns a byte array that has the String contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead.

Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only.

So in your example, the Unicode character is stored in the String in-memory storage using two bytes that are UTF-16 encoded (0x6E 0xFF or 0xFF 0x6E depending on endian), but is stored in the byte array from getBytes() using three bytes that are encoded using whatever the OS default charset is.

In UTF-8, that particular Unicode character happens to use 3 bytes as well (0xEF 0xBD 0xAE).

like image 144
Remy Lebeau Avatar answered Sep 29 '22 20:09

Remy Lebeau