Java bug? Why extra zero byte in utf8 encoding?

Tags:

The following code

public class CharsetProblem {
public static void main(String[] args) {
    //String str = "aaaaaaaaa";
    String str = "aaaaaaaaaa";
    Charset cs1 = Charset.forName("ASCII");
    Charset cs2 = Charset.forName("utf8");

    System.out.println(toHex(cs1.encode(str).array()));
    System.out.println(toHex(cs2.encode(str).array()));

}

public static String toHex(byte[] outputBytes) {

    StringBuilder builder = new StringBuilder();

    for(int i=0; i<outputBytes.length; ++i) {
        builder.append(String.format("%02x", outputBytes[i]));
    }

    return builder.toString();
}
}

returns

61616161616161616161
6161616161616161616100

i.e. utf8 encoding returns excess byte. If we take less a-s, then we'll have no excess bytes. If we take more a-s we can get more and more excess bytes.

Why?

How one can workaround this?

812

asked Jul 03 '12 21:07

3 Answers

You can't just get the backing array and use it. ByteBuffers have a capacity, position and a limit.

System.out.println(cs1.encode(str).remaining());
System.out.println(cs2.encode(str).remaining());

produces:

10
10

Try this instead:

public static void main(String[] args) {
  //String str = "aaaaaaaaa";
  String str = "aaaaaaaaaa";
  Charset cs1 = Charset.forName("ASCII");
  Charset cs2 = Charset.forName("utf8");

  System.out.println(toHex(cs1.encode(str)));
  System.out.println(toHex(cs2.encode(str)));
}

public static String toHex(ByteBuffer buff) {
  StringBuilder builder = new StringBuilder();
  while (buff.remaining() > 0) {
    builder.append(String.format("%02x", buff.get()));
  }
  return builder.toString();
}

It produces the expected:

61616161616161616161
61616161616161616161

answered Oct 03 '22 12:10

You're assuming that the backing array for a ByteBuffer is precisely the correct size to hold the contents, but it's not necessarily. In fact, the contents don't even need to start at the first byte of the array! Study the API for ByteBuffer and you'll understand what's going on: the contents start at the value returned by arrayOffset(), and the end returned by limit().

answered Oct 03 '22 13:10

Ernest Friedman-Hill

The answer has already been given, but as I ran into the same problem, I think it might be useful to provide more details:

The byte array returned by invoking cs1.encode(str).array() or cs2.encode(str).array() returns a reference to the whole array allocated to the ByteBuffer at that time. The capacity of the array may be greater than what's actually used. To retrieve only the used portion you should do something like the following:

ByteBuffer bf1 = cs1.encode(str);
ByteBuffer bf2 = cs2.encode(str);
System.out.println(toHex(Arrays.copyOf(bf1.array(), bf1.limit())));
System.out.println(toHex(Arrays.copyOf(bf2.array(), bf2.limit())));

This yields the result you expect.

answered Oct 03 '22 13:10

Jan David

Related questions
                            
                                "No matching ctor found" while trying to populate a Java class from Clojure
                            
                                Place components at arbitrary (x,y) coordinates
                            
                                What does @Override mean in this java code? [duplicate]
                            
                                Java memory model synchronization: how to induce data visibility bug?
                            
                                Does an instance of superclass get created when we instantiate an object?
                            
                                Generating an Abstract Syntax Tree for java source code using ANTLR
                            
                                Method call to another file
                            
                                How do I catch and recover from a stack overflow in Java?
                            
                                How to print upto two decimal places in java using string builder?
                            
                                logging input/output xml in apache xmlrpc client
                            
                                Javafx 2.0 How-to Application.getParameters() in a Controller.java file
                            
                                Java generics - The type parameter String is hiding the type String
                            
                                Spring Framework in simple terms
                            
                                Java check if an image has transparency
                            
                                Efficiency of creating new objects in a loop
                            
                                Java printing: creating a PageFormat with minimum acceptable margin
                            
                                How can I use typcasting inside a JPQL statement?
                            
                                N-way merge sort a 2G file of strings
                            
                                Handling InterruptedException while waiting for an exit signal (bug in Android?)
                            
                                Regex split numbers and letter groups without spaces

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java bug? Why extra zero byte in utf8 encoding?

Tags:

java

character-encoding

utf-8

Dims

People also ask

3 Answers

Greg Kopff

Ernest Friedman-Hill

Jan David

Recent Activity

Donate For Us