The following code
public class CharsetProblem {
public static void main(String[] args) {
//String str = "aaaaaaaaa";
String str = "aaaaaaaaaa";
Charset cs1 = Charset.forName("ASCII");
Charset cs2 = Charset.forName("utf8");
System.out.println(toHex(cs1.encode(str).array()));
System.out.println(toHex(cs2.encode(str).array()));
}
public static String toHex(byte[] outputBytes) {
StringBuilder builder = new StringBuilder();
for(int i=0; i<outputBytes.length; ++i) {
builder.append(String.format("%02x", outputBytes[i]));
}
return builder.toString();
}
}
returns
61616161616161616161
6161616161616161616100
i.e. utf8 encoding returns excess byte. If we take less a-s, then we'll have no excess bytes. If we take more a-s we can get more and more excess bytes.
Why?
How one can workaround this?
Because it used to be UCS-2, which was a nice fixed-length 16-bits. Of course, 16bit turned out not to be enough. They retrofitted UTF-16 in on top. Here is a quote from the Unicode FAQ: Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts.
NULL is a valid UTF-8 character. If specific languages and their standard libraries choose to treat it as a string terminator (C, I'm looking at you), well, then fine. But it's still valid Unicode.
You can't just get the backing array and use it. ByteBuffers have a capacity, position and a limit.
System.out.println(cs1.encode(str).remaining());
System.out.println(cs2.encode(str).remaining());
produces:
10
10
Try this instead:
public static void main(String[] args) {
//String str = "aaaaaaaaa";
String str = "aaaaaaaaaa";
Charset cs1 = Charset.forName("ASCII");
Charset cs2 = Charset.forName("utf8");
System.out.println(toHex(cs1.encode(str)));
System.out.println(toHex(cs2.encode(str)));
}
public static String toHex(ByteBuffer buff) {
StringBuilder builder = new StringBuilder();
while (buff.remaining() > 0) {
builder.append(String.format("%02x", buff.get()));
}
return builder.toString();
}
It produces the expected:
61616161616161616161
61616161616161616161
You're assuming that the backing array for a ByteBuffer
is precisely the correct size to hold the contents, but it's not necessarily. In fact, the contents don't even need to start at the first byte of the array! Study the API for ByteBuffer
and you'll understand what's going on: the contents start at the value returned by arrayOffset()
, and the end returned by limit()
.
The answer has already been given, but as I ran into the same problem, I think it might be useful to provide more details:
The byte array returned by invoking cs1.encode(str).array()
or cs2.encode(str).array()
returns a reference to the whole array allocated to the ByteBuffer at that time. The capacity of the array may be greater than what's actually used. To retrieve only the used portion you should do something like the following:
ByteBuffer bf1 = cs1.encode(str);
ByteBuffer bf2 = cs2.encode(str);
System.out.println(toHex(Arrays.copyOf(bf1.array(), bf1.limit())));
System.out.println(toHex(Arrays.copyOf(bf2.array(), bf2.limit())));
This yields the result you expect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With