Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java bug? Why extra zero byte in utf8 encoding?

The following code

public class CharsetProblem {
public static void main(String[] args) {
    //String str = "aaaaaaaaa";
    String str = "aaaaaaaaaa";
    Charset cs1 = Charset.forName("ASCII");
    Charset cs2 = Charset.forName("utf8");

    System.out.println(toHex(cs1.encode(str).array()));
    System.out.println(toHex(cs2.encode(str).array()));

}

public static String toHex(byte[] outputBytes) {

    StringBuilder builder = new StringBuilder();

    for(int i=0; i<outputBytes.length; ++i) {
        builder.append(String.format("%02x", outputBytes[i]));
    }

    return builder.toString();
}
}

returns

61616161616161616161
6161616161616161616100

i.e. utf8 encoding returns excess byte. If we take less a-s, then we'll have no excess bytes. If we take more a-s we can get more and more excess bytes.

Why?

How one can workaround this?

like image 812
Dims Avatar asked Jul 03 '12 21:07

Dims


People also ask

Why does Java use UTF-16?

Because it used to be UCS-2, which was a nice fixed-length 16-bits. Of course, 16bit turned out not to be enough. They retrofitted UTF-16 in on top. Here is a quote from the Unicode FAQ: Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts.

Can UTF-8 contain null?

NULL is a valid UTF-8 character. If specific languages and their standard libraries choose to treat it as a string terminator (C, I'm looking at you), well, then fine. But it's still valid Unicode.


3 Answers

You can't just get the backing array and use it. ByteBuffers have a capacity, position and a limit.

System.out.println(cs1.encode(str).remaining());
System.out.println(cs2.encode(str).remaining());

produces:

10
10

Try this instead:

public static void main(String[] args) {
  //String str = "aaaaaaaaa";
  String str = "aaaaaaaaaa";
  Charset cs1 = Charset.forName("ASCII");
  Charset cs2 = Charset.forName("utf8");

  System.out.println(toHex(cs1.encode(str)));
  System.out.println(toHex(cs2.encode(str)));
}

public static String toHex(ByteBuffer buff) {
  StringBuilder builder = new StringBuilder();
  while (buff.remaining() > 0) {
    builder.append(String.format("%02x", buff.get()));
  }
  return builder.toString();
}

It produces the expected:

61616161616161616161
61616161616161616161
like image 70
Greg Kopff Avatar answered Oct 03 '22 12:10

Greg Kopff


You're assuming that the backing array for a ByteBuffer is precisely the correct size to hold the contents, but it's not necessarily. In fact, the contents don't even need to start at the first byte of the array! Study the API for ByteBuffer and you'll understand what's going on: the contents start at the value returned by arrayOffset(), and the end returned by limit().

like image 42
Ernest Friedman-Hill Avatar answered Oct 03 '22 13:10

Ernest Friedman-Hill


The answer has already been given, but as I ran into the same problem, I think it might be useful to provide more details:

The byte array returned by invoking cs1.encode(str).array() or cs2.encode(str).array() returns a reference to the whole array allocated to the ByteBuffer at that time. The capacity of the array may be greater than what's actually used. To retrieve only the used portion you should do something like the following:

ByteBuffer bf1 = cs1.encode(str);
ByteBuffer bf2 = cs2.encode(str);
System.out.println(toHex(Arrays.copyOf(bf1.array(), bf1.limit())));
System.out.println(toHex(Arrays.copyOf(bf2.array(), bf2.limit())));

This yields the result you expect.

like image 20
Jan David Avatar answered Oct 03 '22 13:10

Jan David