Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differing sizes of String representation in Java

Tags:

java

I'm comparing the various ways of storing a String in java by breaking a String down into its constituent parts. I have this code snippet:

final String message = "ABCDEFGHIJ";
System.out.println("As String " + RamUsageEstimator.humanSizeOf(message));
System.out.println("As byte[] " + RamUsageEstimator.humanSizeOf(message.getBytes()));
System.out.println("As char[] " + RamUsageEstimator.humanSizeOf(message.toCharArray()));

This is using sizeof to measure the size of the objects. The results of the above show:

As String 64 bytes
As byte[] 32 bytes
As char[] 40 bytes

Given that a byte is 8 bits and a char is 16 bits why are the results not 10 bytes and 20 bytes respectively?

Also what is the overhead for the String object that causes it to be twice the size of the underlying byte[]?

This is using

java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

On OSX

like image 788
imrichardcole Avatar asked Feb 11 '16 11:02

imrichardcole


People also ask

What is string representation?

A String is represented as objects in Java. Accordingly, an object contains values stored in instance variables within the object. An object also contains bodies of code that operate upon the object. These bodies of code are called methods.

How many bytes is string in Java?

An empty String takes 40 bytes—enough memory to fit 20 Java characters.

How do you determine a strings byte size?

1) s. length() will give you the number of bytes. Since characters are one byte (at least in ASCII), the number of characters is the same as the number of bytes.


2 Answers

The data below is for Hotspot / Java 8 - numbers will vary for other JVMs/Java versions (for example, in Java 7, String has two additional int fields).

A new Object() takes 12 bytes of memory (due to internal things such as the object header).

A String has (number of bytes in brackets):

  • an object header (12),
  • a reference to a char[] (4 - assuming compressed OOP in 64 bit JVM),
  • an int hash (4).

That's 20 bytes but objects get padded to multiples of 8 bytes => 24. So that's already 24 bytes on top of the actual content of the array.

The char[] has a header (12), a length (4) and each char (10 x 2 = 20) padded to the next multiple of 8 - or 40 in total.

The byte[] has a header (12), a length (4) and each byte (10 x 1 = 10) = 26, padded to the next multiple of 8 = 32.

So we get to your numbers.

Also note that the number of bytes depends on the encoding you use - if you retry with message.getBytes(StandardCharsets.UTF_16) for example, you will see that the byte array uses 40 bytes instead of 32.


You can use jol to visualise the memory usage and confirm the calculation above. The output for the char[] is:

 OFFSET  SIZE  TYPE DESCRIPTION                    VALUE
      0     4       (object header)                01 00 00 00 (00000001 00000000 00000000 00000000) (1)
      4     4       (object header)                00 00 00 00 (00000000 00000000 00000000 00000000) (0)
      8     4       (object header)                41 00 00 f8 (01000001 00000000 00000000 11111000) (-134217663)
     12     4       (object header)                0a 00 00 00 (00001010 00000000 00000000 00000000) (10)
     16    20  char [C.<elements>                  N/A
     36     4       (loss due to the next object alignment)
Instance size: 40 bytes (reported by Instrumentation API)

So you can see the header of 12 (first 3 lines), the length (line 4), the chars (line 5) and the padding (line 6).

Similarly for the String (note that this excludes the size of the array itself):

 OFFSET  SIZE   TYPE DESCRIPTION                    VALUE
      0     4        (object header)                01 00 00 00 (00000001 00000000 00000000 00000000) (1)
      4     4        (object header)                00 00 00 00 (00000000 00000000 00000000 00000000) (0)
      8     4        (object header)                da 02 00 f8 (11011010 00000010 00000000 11111000) (-134216998)
     12     4 char[] String.value                   [A, B, C, D, E, F, G, H, I, J]
     16     4    int String.hash                    0
     20     4        (loss due to the next object alignment)
Instance size: 24 bytes (reported by Instrumentation API)
like image 153
assylias Avatar answered Sep 29 '22 13:09

assylias


Each of your test, estimates the size of an Object. In the first case a String object, in the second a byte array object, and finally a char array object. Every object, as instance of a class, may contains some private attributes and other things like that; so you cannot expect something better than: a String of 10 chars, contains at least the 10 chars, each of 2 bytes long, then the whole size should be ≥20 bytes, which is coherent with your results.

For the byte/char comparison you are wrong, because the byte array from a string will give you all the bytes for a given encoding. It may happens that your current encoding uses more than one byte for a char.

You may have a look at Java source code for Object, String class and array support in JVM to understand what happens exactly.

like image 35
Jean-Baptiste Yunès Avatar answered Sep 29 '22 13:09

Jean-Baptiste Yunès