Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the length of a string encoded in a ByteBuffer

byte[] byteArray = Charset.forName("UTF-8").encode("hello world").array();
System.out.println(byteArray.length);

Why does the above line of code prints out 12, shouldn't it be printing 11 instead?

like image 364
Umesh Avatar asked Sep 19 '14 19:09

Umesh


People also ask

How do I find the length of a ByteBuffer?

After you've written to the ByteBuffer, the number of bytes you've written can be found with the position() method. If you then flip() the buffer, the number of bytes in the buffer can be found with the limit() or remaining() methods.

How do I get strings from ByteBuffer?

The toString() method of ByteBuffer class is the inbuilt method used to returns a string representing the data contained by ByteBuffer Object. A new String object is created and initialized to get the character sequence from this ByteBuffer object and then String is returned by toString().

How do you find the length of a string in a byte?

So a string size is 18 + (2 * number of characters) bytes. (In reality, another 2 bytes is sometimes used for packing to ensure 32-bit alignment, but I'll ignore that). 2 bytes is needed for each character, since .

What is the byte order of ByteBuffer?

By default, the order of a ByteBuffer object is BIG_ENDIAN. If a byte order is passed as a parameter to the order method, it modifies the byte order of the buffer and returns the buffer itself. The new byte order may be either LITTLE_ENDIAN or BIG_ENDIAN.


2 Answers

The length of the array is the size of the ByteBuffer's capacity, which is generated from, but not equal to the number of characters you are encoding. Let's take a look at how we allocate memory for a ByteBuffer...

If you drill into the encode() method, you'll find that CharsetEncoder#encode(CharBuffer) looks like this:

public final ByteBuffer encode(CharBuffer in)
    throws CharacterCodingException
{
    int n = (int)(in.remaining() * averageBytesPerChar());
    ByteBuffer out = ByteBuffer.allocate(n);
    ...

According to my debugger, the averageBytesPerChar of a UTF_8$Encoder is 1.1, and the input String has 11 characters. 11 * 1.1 = 12.1, and the code casts the total to an int when it does the calculation, so the resulting size of the ByteBuffer is 12.

like image 125
azurefrog Avatar answered Nov 15 '22 19:11

azurefrog


Because it returns a ByteBuffer. That's the buffer's capacity (not really even that because of possible slicing), not how many bytes are used. It's a bit like how malloc(10) is free to return 32 bytes of memory.

System.out.println(Charset.forName("UTF-8").encode("hello world").limit());

That's 11 (as expected).

like image 20
David Ehrmann Avatar answered Nov 15 '22 19:11

David Ehrmann