Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What property of the bit pattern is it that causes collisions?

I was reading about Java's approach to randomize hash keys here
Apparently the idea is to make sure that the lower bits are "random" to help the distribution but I am trying to understand this more.
So if we have a table of size 10 then the numbers 0,10,20,30,40 etc all fall in bucket 0, the numbers 1,11,21,31 etc fall in bucket 1 etc (using modulo 10).
So changing the bit patterns you could make these go to different buckets instead of all going to bucket 0.
But what I am not clear about is what property is it that makes the low order bits affect this and we need to randomize them. So we have:

0000 0000 (0)  
0000 1010 (10)  
0001 0100 (20) 
0001 1110 (30)  
0010 1000 (40) 

What is the regularity in the low order bits that makes them placed to the same slot?
Perhaps I am confused on the following? My understanding is that it is some regularity in low order bits that cause collisions and we try to randomize bit to compensate

like image 429
Jim Avatar asked Jun 04 '15 18:06

Jim


People also ask

What causes a hash collision?

Definition: A collision occurs when more than one value to be hashed by a particular hash function hash to the same slot in the table or data structure (hash table) being generated by the hash function.

What causes a collision in Java?

A collision will occur when two different keys have the same hashCode, which can happen because two unequal objects in Java can have the same hashCode.

How does collision affect hashing?

In computer science, a hash collision or clash is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of bits.

What is hash collision example?

Hash Collisions For example, assume a hash function h(text) sums of all character codes in a text. It will produce the same hash value (collision) for texts holding the same letters in different order, i.e. h('abc') == h('cab') == h('bca') .


1 Answers

Some hash functions do a really bad job of randomizing low order bits.

One classic case is the use of hardware addresses as a hash for object references ("pointers" in C), which would otherwise be a reasonable way of cheaply getting a unique number for an object-id. This would work fine if the hash table's bucket count were a prime number, but for hash implementations where the number of buckets is always a power of 2, the fact that all the hashes are divisible by 8 would mean that most buckets were empty.

That's an extreme case, but any time that the data to be hashed is not uniformly distributed and the hash function tends to preserve low-order bits, you'll find some bias in the bucket assignments.

like image 55
rici Avatar answered Sep 28 '22 01:09

rici