Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java HashMap detect collision

Is there a way to detect collision in Java Hash-map ? Can any one point out some situation's where lot of collision's can take place. Of-course if you override the hashcode for an object and simply return a constant value collision is sure to occur.I'm not talking about that.I want to know in what all situations other that the previously mentioned do huge number of collisions occur without modifying the default hashcode implementation.

like image 615
Emil Avatar asked Aug 11 '10 05:08

Emil


People also ask

How does HashMap detect collision in Java?

So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1. hashCode()) == scramble(k2. hashCode()) . This is always true if k1.

What happens if there is a collision in HashMap?

A collision, or more specifically, a hash code collision in a HashMap, is a situation where two or more key objects produce the same final hash value and hence point to the same bucket location or array index.

How collision is avoided in HashMap?

The only way to avoid (or rather minimize) collisions is to create a hash function that creates the best possible distribution of values throughout the HashMap. Depending on the density of your HashMap and the quality of your hash code , collisions are almost inevitable, hence the need to override the two methods.

What happens in a Java HashMap when 2 keys hash to the same location?

It's perfectly legal for two unequal objects to have the same hash code. It's used by HashMap as a "first pass filter" so that the map can quickly find possible entries with the specified key. The keys with the same hash code are then tested for equality with the specified key.


2 Answers

I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters).

Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. From this list, HashMap's smearing function is the equivalent of:

public int scramble(int h) {
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1.hashCode()) == scramble(k2.hashCode()). This is always true if k1.hashCode() == k2.hashCode() (otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient, but not necessary condition for a collision to occur.

Edit: Actually, the above necessary and sufficient condition should have been compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode())) - the compress function takes an integer and maps it to {0, ..., N-1}, where N is the number of buckets, so it basically selects a bucket. Usually, this is simply implemented as hash % N, or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), as hash & N (faster). ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java). Thanks go to ILMTitan for spotting my sloppiness.

Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about.

(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - e.g. new Integer(1 * 1024), new Integer(2 * 1024) *, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits).

All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode().

PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. This is the related old bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4669519

like image 69
Dimitris Andreou Avatar answered Oct 05 '22 14:10

Dimitris Andreou


Simple example: hashing a Long. Obviously there are 64 bits of input and only 32 bits of output. The hash of Long is documented to be:

(int)(this.longValue()^(this.longValue()>>>32))

i.e. imagine it's two int values stuck next to each other, and XOR them.

So all of these will have a hashcode of 0:

0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)

etc

I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture.

Obviously any hash where there are more than 232 possible values will have collisions, but in many cases they're harder to produce. For example, while I've certainly seen hash collisions on String using just ASCII values, they're slightly harder to produce than the above.

like image 39
Jon Skeet Avatar answered Oct 05 '22 14:10

Jon Skeet