How to test a hash function?

Tags:

Is there a way to test the quality of a hash function? I want to have a good spread when used in the hash table, and it would be great if this is verifyable in a unit test.

EDIT: For clarification, my problem was that I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap. So I need a different hash, and would like to have a Unit Test so that this problem cannot creep in any more.

724

asked Dec 24 '08 22:12

martinus

2 Answers

You have to test your hash function using data drawn from the same (or similar) distribution that you expect it to work on. When looking at hash functions on 64-bit longs, the default Java hash function is excellent if the input values are drawn uniformly from all possible long values.

However, you've mentioned that your application uses the long to store essentially two independent 32-bit values. Try to generate a sample of values similar to the ones you expect to actually use, and then test with that.

For the test itself, take your sample input values, hash each one and put the results into a set. Count the size of the resulting set and compare it to the size of the input set, and this will tell you the number of collisions your hash function is generating.

For your particular application, instead of simply XORing them together, try combining the 32-bit values in ways a typical good hash function would combine two indepenet ints. I.e. multiply by a prime, and add.

119

answered Oct 14 '22 15:10

Dave L.

First I think you have to define what you mean by a good spread to yourself. Do you mean a good spread for all possible input, or just a good spread for likely input?

For example, if you're hashing strings that represent proper full (first+last) names, you're not going to likely care about how things with the numerical ASCII characters hash.

As for testing, your best bet is to probably get a huge or random input set of data you expect, and push it through the hash function and see how the spread ends up. There's not likely going to be a magic program that can say "Yes, this is a good hash function for your use case.". However, if you can programatically generate the input data, you should easily be able to create a unit test that generates a significant amount of it and then verify that the spread is within your definition of good.

Edit: In your case with a 64 bit long, is there even really a reason to use a hash map? Why not just use a balanced tree directly, and use the long as the key directly rather than rehashing it? You pay a little penalty in overall node size (2x the size for the key value), but may end up saving it in performance.

answered Oct 14 '22 15:10

Tulenian

Related questions
                            
                                Is O(log n) always faster than O(n)
                            
                                Edit distance recursive algorithm -- Skiena
                            
                                Interpolation algorithms when downscaling
                            
                                Rehashing process in hashmap or hashtable
                            
                                Dijkstra's Algorithm and Cycles
                            
                                Max double slice sum
                            
                                Shuffle string c#
                            
                                Python iterate through array while finding the mean of the top k elements
                            
                                How do I use for_each to output to cout?
                            
                                Generating unique, hard-to-guess "coupon" codes
                            
                                Do dicts preserve iteration order if they are not modified?
                            
                                How can we find second maximum from array efficiently?
                            
                                Algorithm to generate all combinations of a string
                            
                                How do you implement the factorial function in C++? [duplicate]
                            
                                Gravity Sort : Is this possible programmatically? [closed]
                            
                                Point and ellipse (rotated) position test: algorithm
                            
                                Find the number in an array that is closest to a given number
                            
                                Random simple connected graph generation with given sparseness
                            
                                List of all classification algorithms
                            
                                Hungarian Algorithm: finding minimum number of lines to cover zeroes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to test a hash function?

Tags:

language-agnostic

algorithm

unit-testing

hash

testing

martinus

People also ask

2 Answers

Dave L.

Tulenian

Recent Activity

Donate For Us