Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unknown bytes is returned by method getBytes()



import java.io.UnsupportedEncodingException;
import java.util.Arrays;

public class Main {
 public static void main(String[] args)
 {
  try 
  {
   String s = "s";
   System.out.println( Arrays.toString( s.getBytes("utf8") ) );
   System.out.println( Arrays.toString( s.getBytes("utf16") ) );
   System.out.println( Arrays.toString( s.getBytes("utf32") ) );
  }  
  catch (UnsupportedEncodingException e) 
  {
   e.printStackTrace();
  }
 }
}

Console:


[115]
[-2, -1, 0, 115]
[0, 0, 0, 115]

What is it?

[-2, -1] - ???

Also, i noted, that if i do that:


String s = new String(new char[]{'\u1251'});
System.out.println( Arrays.toString( s.getBytes("utf8") ) );
System.out.println( Arrays.toString( s.getBytes("utf16") ) );
System.out.println( Arrays.toString( s.getBytes("utf32") ) );

Console:


[-31, -119, -111]
[-2, -1, 18, 81]
[0, 0, 18, 81]
like image 507
mr. Vachovsky Avatar asked Nov 16 '10 13:11

mr. Vachovsky


People also ask

What does getBytes return?

getbytes() function in java is used to convert a string into a sequence of bytes and returns an array of bytes. Syntax: public byte[] getBytes()

How do you convert bytes to strings?

One method is to create a string variable and then append the byte value to the string variable with the help of + operator. This will directly convert the byte value to a string and add it in the string variable. The simplest way to do so is using valueOf() method of String class in java.

How do you convert a byte array into a string?

There are two ways to convert byte array to String: By using String class constructor. By using UTF-8 encoding.


2 Answers

Don't forget that bytes are unsigned in Java. So -2, -1 really means 0xfe 0xff... and U+FEFF is the Unicode byte order mark (BOM)... that's what you're seeing here in the UTF-16 version.

To avoid getting the BOM when encoding, use UTF-16BE or UTF-16LE explicitly. (I would also suggest using the names which are guaranteed by the platform rather than just "utf8" etc. Admittedly the name is guaranteed to be found case-insensitively, but the lack of a hyphen makes it less reliable, and there are no downsides to using the canonical name.)

like image 182
Jon Skeet Avatar answered Oct 26 '22 12:10

Jon Skeet


The -2, -1 is a Byte Order Mark (BOM - U+FEFF) that indcates that the following text is encoded in UTF-16 format.

You are probably getting this because, while there is only one UTF8 and UTF32 encoding, there are two UTF16 encodings UTF16LE and UTF16BE, where the 2 bytes in the 16-bit value are stored in Big-Endian or Little Endian format.

As the values that come back are 0xFE xFF, this suggests that the encoding is UTF16BE

like image 35
Simon Callan Avatar answered Oct 26 '22 12:10

Simon Callan