Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ArrayList and HashSet memory allocation strange test results

I was inspired by this topic: Performance and Memory allocation comparision between List and Set to actually run some tests and measure the performance difference between ArrayList and HashSet.

The most upvoted answer, in the mentioned topic, that intrigued me a lot (link), says:

HashSet consumes about 5.5 times more memory than ArrayList for the same number of elements

With the help of ScalaMeter I wanted to make sure about this.

I made two, simple tests, adding from 10000 to 100000 elements to both ArrayList and HashSet. Setting the initial size to the maximum did not change the results. I tested those collections with two types:

  • Int (putting consecutive numbers 0 to 100000)
  • String (putting random String using Apache RandomStringUtils)

The code is available on my repository here.

And running those, gave me this results:

  • X-axis - size -> size of the collection
  • Y-axis - value -> amount of kB used

For collections holding Int: Integer results

For collections holding String of size 10: String results size 10

For collections holding String of size 50: String results size 50

The question:

What happened to the theory mentioned in the quoted answer? Is it false? Or probably there is some mistake on my side?

Thanks :)!

Update after @andrzej answer I have updated the code (and the repository) once again. The results are getting better but still the results are not 5.5x different. I am checking something more now.

like image 565
Atais Avatar asked Jul 28 '16 14:07

Atais


3 Answers

Please add measurement object as return value.

measure method "Int" in {
  using(sizes) curve listS in { i =>
    val c = new util.ArrayList[Int](i)
    (0 until i).map(t => c.add(t))
    c // return c
  }

  using(sizes) curve setS in { i =>
    val c = new util.HashSet[Int]()
    (0 until i).map(t => c.add(t))
    c // return c
  }
}
like image 45
Andrzej Jozwik Avatar answered Nov 03 '22 14:11

Andrzej Jozwik


What happened to the theory mentioned in the quoted answer? Is it false?

We can do some calculations to get an estimate:

Let's look at the OpenJDK source for ArrayList and HashMap (since HashSet is just a wrapper around HashMap) for hints.

Assume you have n elements to store.

ArrayList

Elements are stored in the field transient Object[] elementData;. So length of elementData must be at least n.
Suppose you instantiated the list with new ArrayList<>(n) and so elementData.length is exactly n. Then the size of your list is n*c bytes (where c is the size of an object reference). Here I ignored the size field and the object header of the list.

HashMap

HashMap stores elements in transient Node<K,V>[] table; where a node has fields

final int hash;
final K key;
V value;
Node<K,V> next;

Then for storing n elements you need n nodes or n*(3*c + 4) bytes i.e each node has 3 object references - 3*c bytes - and an int - 4 bytes.
According to HashMap javadoc:

When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

Based on this I'll estimate that table.length == 2*n.
Summing up a hashmap requires n*2*c + n*(3*c + 4) = n*5*c + n*4 bytes.

Summary

Now let's suppose you have a 64bit JVM and the size of an object reference is 8 bytes (i.e c = 8) (let's igntore things like compressed oops). Then n*5*c + n*4 = n*5*8 + n*4 = n*44 and n*c = n*8.
Finally n*44 / n*8 = 5.5

So the original theory that HashSet consumes about 5.5 times more memory than ArrayList seems quite plausible and it seems likely that something is not right with your measurements.

like image 135
binoternary Avatar answered Nov 03 '22 15:11

binoternary


I think, there're two problems here:

  1. As Andrzej mentioned, you don't return your collections from the benchmark snippets. Scalameter measures footprint by performing GC before and after the benchmark execution (find details here). If you don't return the collection, it's just removed from memory by the after-test GC and test results are useless. It explains why the memory footprints in your tests remain small (around four bytes per object) and don't differ. But it doesn't explain why the footprint grows when the collection size grows, and here comes the second problem.

  2. Some garbage collectors (CMS and G1 particularly) don't guarantee that after performing a garbage collection all the dead objects are removed from memory. If your JVM choses one of these collectors (or if you specify it manually), that would explain the memory footprint uptrend. You can check what collector is in use by providing -XX:+PrintFlagsFinal option to your test and finding the values of UseG1GC and UseConcMarkSweepGC flags.

like image 1
Andrew Lygin Avatar answered Nov 03 '22 14:11

Andrew Lygin