Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

collection.mutable.OpenHashMap vs collection.mutable.HashMap

For put and get operations OpenHashMap outperform HashMap by about 5 times: https://gist.github.com/1423303

Are any cases when HashMap should be preferred over OpenHashMap?

like image 821
Andriy Plokhotnyuk Avatar asked Dec 07 '11 13:12

Andriy Plokhotnyuk


1 Answers

Your code exactly matches one of the use cases for OpenHashMap. Your code:

println ("scala OpenHashMap: " + time (warmup) {  
  val m = new scala.collection.mutable.OpenHashMap[Int,Int]; 
  var i = 0;
  var start = System.currentTimeMillis();
  while(i<100000) { m.put(i,i);i=i+1;};
})

The explanation for OpenHashMap (scaladoc):

A mutable hash map based on an open hashing scheme. The precise scheme is undefined, but it should make a reasonable effort to ensure that an insert with consecutive hash codes is not unneccessarily penalised. In particular, mappings of consecutive integer keys should work without significant performance loss.

My emphasis. Which explains your findings. When to use OpenHashMap rather than HashMap? See Wikipedia. From there:

Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods.

The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution of keys is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucket—that is, on the load factor.

Chained hash tables remain effective even when the number of table entries n is much higher than the number of slots. Their performance degrades more gracefully (linearly) with the load factor. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list, and possibly even faster than a balanced search tree.

For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries; so the worst-case cost is proportional to the number n of entries in the table.

This is a generic explanation. As ever with these things, your performance will vary depending upon the use case, if you care about it, you need to measure it.

like image 163
Matthew Farwell Avatar answered Nov 02 '22 06:11

Matthew Farwell