Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I evaluate a hash table implementation? (Using HashMap as reference)

Problem:

  • I need to compare 2 hash table implementations (well basically HashMap with another one) and make a reasonable conclusion.

  • I am not interested in 100% accuracy but just being in the right direction in my estimation.

  • I am interested in the difference not only per operation but mainly on the hashtable as a "whole".

  • I don't have a strict requirement on speed so if the other implementation is reasonably slower I can accept it but I do expect/require that the memory usage be better (since one of the hashtables is backed by primitive table).

What I did so far:

Originally I created my own custom "benchmark" with loops and many calls to hint for gc to get a feeling of the difference but I am reading online that using a standard tool is more reliable/appropriate.
Example of my approach (MapInterface is just a wrapper so I can switch among implementations.):

int[] keys = new int[10000000];
String[] values = new String[10000000];  
for(int i = 0; i < keys.length; ++i) {  
   keys[i] = i;  
   values[i] = "" + i;
}

if(operation.equals("put", keys, values)) {  
   runPutOperation(map);  
}  

public static long[] runOperation(MapInterface map, Integer[] keys, String[] values) {  
    long min = Long.MAX_VALUE;  
    long max = Long.MIN_VALUE;  
    long run = 0;  
    for(int i = 0; i < 10; ++i) {  
       long start = System.currentTimeMillis();  
       for(int i = 0; i < keys.length; ++i) {          
            map.put(keys[i], values[i]);  
        }
        long total = System.currentTimeMillis() - start;  
        System.out.println(total/1000d + " seconds");    
        if(total < min) {
            min = time;
        }
        if(total > max) {
            max = time;
         }
         run += time;  
         map = null;  
         map = createNewHashMap();
         hintsToGC();    
   }  
  return new long[] {min, max, run};
 }     


public void hintsToGC() {  
    for(int i = 0; i < 20; ++i) {
            System.out.print(". ");
            System.gc();            
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {              
                e.printStackTrace();
          }           
       } 
}


private HashMapInterface<String> createNewHashMap() {  
    if(jdk) {  
        return new JDKHashMapWrapper<String>();  
    }  
    else {
        return new AlternativeHashMapWrapper<String>();   
    }  
 }  



public class JDKHashMapWrapper implements HashMapInterface<String>  {
    HashMap<Integer, String> hashMap;         
    JDKHashMapWrapper() {   
       hashMap = new HashMap<Integer, String>();  
    }  
    public String put(Integer key, String value)  {
       return hashMap.put(key, value);  
    }  
 //etc  
}

(I want to test put, get, contains and the memory utilization)
Can I be sure by using my approach that I can get reasonable measurements?
If not what would be the most appropriate tool to use and how?

Update:
- I also test with random numbers (also ~10M random numbers) using SecureRandom.
- When the hash table resizes I print the logical size of the hash table/size of the actual table to get the load factor

Update:
For my specific case, where I am interested also in integers what can of pitfalls are there with my approach?

UPDATE after @dimo414 comments:

Well at a minimum the hashtable as a "whole" isn't meaningful

I mean how the hashtable behaves under various loads both at runtime and in memory consumption.

Every data structure is a tradeoff of different methods

I agree. My trade-off is an acceptable access penalty for memory improvement

You need to identify what features you're interested in verifying

1) put(key, value);
2) get(key, value);
3) containsKey(key);
4) all the above when having many entries in the hash table

like image 203
Cratylus Avatar asked Oct 31 '22 23:10

Cratylus


1 Answers

Some key consideration for using Hash tables is the size of the "buckets" allocation, the collision resolution strategy, and the shape of your data. Essentially, a Hash table takes the key supplied by the application and then hashes it to a value less than or equal to the number of allocated buckets. When two key values hash to the same bucket, the implementation has to resolve the collision and return the right value. For example, one could have a sorted linked list for each bucket and that list is searched.

If your data happens to have a lot of collisions, then your performance will suffer, because the Hash table implementation will spend too much time resolving the collision. On the other hand, if you have a very large number of buckets, you solve the collision problem at the expense of memory. Also, Java's built-in HashMap implementation will "rehash" if the number of entries gets larger than a certain amount - I imagine this is an expensive operation that is worth avoiding.

Since your key data is the positive integers from 1 to 10M, your test data looks good. I would also ensure that the different hash tables implementations were initialized to the same bucket size for a given test, otherwise it's not a fair comparison. Finally, I would vary the bucket size over a pretty significant range and rerun the tests to see how the implementations changed their behavior.

like image 99
schtever Avatar answered Nov 15 '22 05:11

schtever