Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hashCode, implementation, and relation to HashMap

So I asked another related question here: java string hash function with avalanche effect, but I have a different, related question now.

What I established in that question was that the hashCode() function for String does not have an avalanche effect. This means, for example, that if I have strings "k1", "k2", "k3", and I call hashCode() on each, the values returned will be contiguous.

Now, based on my recollection of data structures 101, I was under the impression that this is a bad thing. Because assuming that HashMap chooses buckets by an algorithm something like:

class HashMap {
    private int capacity;
    private int chooseBucket(String key) {
        return key.hashCode() % capacity;
    }
}

It would mean that similar keys are stored in contiguous buckets, leading to a higher rate of collisions, degrading big-O lookup time from O(1) to be...who knows how bad...maybe worse than O(log n).

The types of answers I got to my first question were along the lines of 'avalanche effect isn't needed here', 'it's only for cryptography hash functions', and 'the hashCode implementation for strings is fast and works well for small hash maps'.

Which confuses me. All data structures are fast when they're small. Wouldn't Sun provide a default hashCode function that will work well for large data sets? That's when the performance of HashMap really matters anyway, isn't it?

Or, am I missing something? Please enlighten me.

like image 266
Kevin Avatar asked Mar 04 '11 02:03

Kevin


1 Answers

Storing keys in contiguous buckets does not cause performance degradation. Storing keys in the same bucket (e.g., chaining) does. When using chaining to resolve hash collisions:

  • Worst-case scenario: that every hash value is the same, so all elements end up in the same bucket, in which case you get O(n) performance (assuming the chains are linked lists)
  • Best-case scenario: every hash value is different, so each element ends up in a different bucket, so you get the expected O(1) performance.

Hash codes for use in hash tables (and the like) do not need an avalanche effect.

like image 73
Matt Ball Avatar answered Oct 19 '22 21:10

Matt Ball