Is there a way to detect collision in Java Hash-map ? Can any one point out some situation's where lot of collision's can take place. Of-course if you override the hashcode for an object and simply return a constant value collision is sure to occur.I'm not talking about that.I want to know in what all situations other that the previously mentioned do huge number of collisions occur without modifying the default hashcode implementation.
So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1. hashCode()) == scramble(k2. hashCode()) . This is always true if k1.
A collision, or more specifically, a hash code collision in a HashMap, is a situation where two or more key objects produce the same final hash value and hence point to the same bucket location or array index.
The only way to avoid (or rather minimize) collisions is to create a hash function that creates the best possible distribution of values throughout the HashMap. Depending on the density of your HashMap and the quality of your hash code , collisions are almost inevitable, hence the need to override the two methods.
It's perfectly legal for two unequal objects to have the same hash code. It's used by HashMap as a "first pass filter" so that the map can quickly find possible entries with the specified key. The keys with the same hash code are then tested for equality with the specified key.
I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters).
Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. From this list, HashMap's smearing function is the equivalent of:
public int scramble(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1.hashCode()) == scramble(k2.hashCode())
. This is always true if k1.hashCode() == k2.hashCode()
(otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient, but not necessary condition for a collision to occur.
Edit: Actually, the above necessary and sufficient condition should have been compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode()))
- the compress
function takes an integer and maps it to {0, ..., N-1}
, where N
is the number of buckets, so it basically selects a bucket. Usually, this is simply implemented as hash % N
, or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), as hash & N
(faster). ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java). Thanks go to ILMTitan for spotting my sloppiness.
Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about.
(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - e.g. new Integer(1 * 1024)
, new Integer(2 * 1024)
*, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits).
All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode().
PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. This is the related old bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4669519
Simple example: hashing a Long
. Obviously there are 64 bits of input and only 32 bits of output. The hash of Long
is documented to be:
(int)(this.longValue()^(this.longValue()>>>32))
i.e. imagine it's two int
values stuck next to each other, and XOR them.
So all of these will have a hashcode of 0:
0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)
etc
I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture.
Obviously any hash where there are more than 232 possible values will have collisions, but in many cases they're harder to produce. For example, while I've certainly seen hash collisions on String
using just ASCII values, they're slightly harder to produce than the above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With