<pre class="prettyprint"><code>public class UTF8 { public static void main(String[] args){ String s = "ﾖ"; //0xFF6E System.out.println(s.getBytes().length);//length of the string System.out.println(s.charAt(0));//first character in the string } } </code></pre> output: <pre class="prettyprint"><code>3 ﾖ </code></pre> Please help me understand this. Trying to understand how utf8 encoding works in java. As per java doc definition of char char: The char data type is a single 16-bit Unicode character. Does it mean char type in java can only support those unicode characters that can be represented with 2 bytes and not more than that? In the above program, the no of bytes allocated for that string is 3 but in the third line which returns first character( 2 bytes in java) can hold a character which is 3 bytes long? really confused here? Any good references regarding this concept in java/ general would be really appreciated.

Nothing in your code example is directly using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair. If you do not pass a parameter value to <code>String.getBytes()</code>, it returns a byte array that has the <code>String</code> contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use <code>getBytes("UTF-8")</code> instead. Calling <code>String.charAt()</code> returns an original UTF-16 encoded char from the String's in-memory storage only. So in your example, the Unicode character <code>ｮ</code> is stored in the <code>String</code> in-memory storage using two bytes that are UTF-16 encoded (<code>0x6E 0xFF</code> or <code>0xFF 0x6E</code> depending on endian), but is stored in the byte array from <code>getBytes()</code> using three bytes that are encoded using whatever the OS default charset is. In UTF-8, that particular Unicode character happens to use 3 bytes as well (<code>0xEF 0xBD 0xAE</code>).

java utf8 encoding - char, string types

Tags:

public class UTF8 {     public static void main(String[] args){         String s = "ﾖ"; //0xFF6E         System.out.println(s.getBytes().length);//length of the string         System.out.println(s.charAt(0));//first character in the string     } }

output:

3 ﾖ

Please help me understand this. Trying to understand how utf8 encoding works in java. As per java doc definition of char char: The char data type is a single 16-bit Unicode character.

Does it mean char type in java can only support those unicode characters that can be represented with 2 bytes and not more than that?

In the above program, the no of bytes allocated for that string is 3 but in the third line which returns first character( 2 bytes in java) can hold a character which is 3 bytes long? really confused here?

Any good references regarding this concept in java/ general would be really appreciated.

879

asked Aug 29 '12 22:08

akd

1 Answers

Nothing in your code example is directly using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.

If you do not pass a parameter value to String.getBytes(), it returns a byte array that has the String contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead.

Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only.

So in your example, the Unicode character ｮ is stored in the String in-memory storage using two bytes that are UTF-16 encoded (0x6E 0xFF or 0xFF 0x6E depending on endian), but is stored in the byte array from getBytes() using three bytes that are encoded using whatever the OS default charset is.

In UTF-8, that particular Unicode character happens to use 3 bytes as well (0xEF 0xBD 0xAE).

144

answered Sep 29 '22 20:09

Remy Lebeau

Related questions
                            
                                CSS - New Firefox-release doesn't show Border-Image anymore
                            
                                Matlab: How to random shuffle columns of matrix
                            
                                How to add stretchable spacer in Qtoolbar?
                            
                                How can I add a README.md file with Xcode?
                            
                                Change filenames to lowercase in Ubuntu in all subdirectories [closed]
                            
                                Skip Item in Dataflow TransformBlock
                            
                                How to increase intellij 32bit xmx more than 1GB?
                            
                                mysql date comparison with date_format
                            
                                Mongoose pre.save() async middleware not working as expected
                            
                                Spyder - UMD has deleted: module
                            
                                How to Group By Year and Month in MySQL
                            
                                How are member types implemented?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

java utf8 encoding - char, string types

Tags:

akd

People also ask

1 Answers

Remy Lebeau

Recent Activity

Donate For Us