Say I have a population of key-value pairs which I plan to store in a hash table. The population is fixed and will never change. What optimizations are available to me to make the hash table as fast as possible? Which optimizations should I concentrate on? This is assuming I have a lot of space. There will be a reasonable number of pairs (say no more than 100,000).
EDIT: I want to optimize look up. I don't care how long it takes to build.
The simplest and most obvious improvement would be to increase the number of buckets in the hash table to something like 1.2 million -- at least assuming your hash function can generate numbers in that range (which it typically will).
But a good general “rule of thumb” is: The hash table should be an array with length about 1.3 times the maximum number of keys that will actually be in the table, and. Size of hash table array should be a prime number.
In general, a hash function should depend on every single bit of the key, so that two keys that differ in only one bit or one group of bits (regardless of whether the group is at the beginning, end, or middle of the key or present throughout the key) hash into different values.
Hash tables often avoid this problem by making sure that the hash table size is a prime number. When you resize the table, double the size and then round up to the first prime number larger than that. Doing this avoids the clustering problems similar to what you describe.
I would make sure that your key's hash to unique values. This will ensure that every lookup will be constant time, and thus, as fast as possible.
Since you can never have more than 100,000 keys, it is entirely possible to have 100,000 hash values.
Also, make sure that you use the constructor that takes an int
to specify the initial capacity (Set it to 100,000), and a float to set the load factor. (Use 1
) Also, doing this requires that you have a perfect hash function for your keys. But, this will result in the fastest possible lookup, in the least amount of memory.
In general, to optimize a hash table, you want to minimize collisions in the determination of your hash, so your buckets won't contain more than one item and the hash-search will return immediately.
Most of the time, that means that you should measure the output of your hash function on the problem space. So i guess i'd recommend looking into that
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With