I was trying to understand character encoding in Java. Characters in Java are being stored in 16 bits using UTF-16 encoding. So while i am converting a string containing 6 character to byte i am getting 6 bytes as below, I am expecting it to be 12. Is there any concept i am missing ?
package learn.java;
public class CharacterTest {
public static void main(String[] args) {
String str = "Hadoop";
byte bt[] = str.getBytes();
System.out.println("the length of character array is " + bt.length);
}
}
O/p :the length of character array is 6
As per @Darshan When trying with UTF-16 encoding to get bytes the result is also not expecting .
package learn.java;
public class CharacterTest {
public static void main(String[] args) {
String str = "Hadoop";
try{
byte bt[] = str.getBytes("UTF-16");
System.out.println("the length of character array is " + bt.length);
}
catch(Exception e)
{
}
}
}
o/p: the length of character array is 14
In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).
See http://www.unicode.org/faq/utf_bom.html#gen7
EDIT - Use this program to look into the actual bytes generated by different encodings:
public class Test {
public static void main(String args[]) throws Exception {
// bytes in the first argument, encoded using second argument
byte[] bs = args[0].getBytes(args[1]);
System.err.println(bs.length + " bytes:");
// print hex values of bytes and (if printable), the char itself
char[] hex = "0123456789ABCDEF".toCharArray();
for (int i=0; i<bs.length; i++) {
int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
System.err.print(hex[b>>4] + "" + hex[b&0xf]
+ ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
+ ( (i%4 == 3) ? "\n" : " "));
}
System.err.println();
}
}
For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:
$ javac Test.java && java -cp . Test hello UTF-16
12 bytes:
FEþ FFÿ 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo
And
$ javac Test.java && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00.
And
$ javac Test.java && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo
As per the String.getBytes()
method's documentation, the string is encoded into a sequence of bytes using the platform's default charset.
I assume, your platform default charset will be ISO-8859-1 (or a similar one-byte-per-char-charset). These charsets will encode one character into one byte.
If you want to specify the encoding, use the method String.getBytes(Charset)
or String.getBytes(String)
.
About the 16-bit storing: This is how Java internally stores characters, so also strings. It is based on the original Unicode specification.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With