Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hash table: why size should be prime? [duplicate]

Possible Duplicate:
Why should hash functions use a prime number modulus?

Why is it necessary for a hash table's (the data structure) size to be a prime?

From what I understand, it assures a more even distribution but is there any other reason?

like image 690
Olivier Lalonde Avatar asked Oct 20 '10 16:10

Olivier Lalonde


1 Answers

The only reason is to avoid clustering of values into a small number of buckets (yes, distribution). A more even distributed hashtable will perform more consistently.

from http://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html

If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering

  1. Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult if your hashTable is supposed to have millions of entries.

  2. Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.

Update: (from original answer author)

This answer is correct for a common implementation of a hash table, including the Java implementation of the original Hashtable as well as the current implementation of .NET's Dictionary.

Both the answer and the supposition that capacity should be prime are not accurate for Java's HashMap though. The implementation of a HashMap is very different and utilizes a table with a size of base 2 to store buckets and uses n-1 & hash to calculate which bucket to use as opposed to the more traditional hash % n formula.

Java's HashMap will force the actual used capacity to be the next largest base 2 number above the requested capacity.

Compare Hashtable:

int index = (hash & 0x7FFFFFFF) % tab.length 

https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/util/Hashtable.java#L364

To HashMap:

first = tab[(n - 1) & hash] 

https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/util/HashMap.java#L569

like image 66
Samuel Neff Avatar answered Nov 10 '22 04:11

Samuel Neff