If I notice that a hash table (or any other data structure built on a hash table) is filling up, at what point should you build a new table with more buckets. And given n items in the table so far, how do you figure out how many buckets to use in the new one?
So let's say I have 100 buckets. Should I reorganize it when there are 50 items in it? 500? 5000? Or should I look for the most-full bucket and key on that? Then when I hit that point how big do I make the new hash table?
Related to this, if you know beforehand roughly how many items will go in, is there a way to compute the number of buckets to get a good average performance?
I know the real answer depends on a lot of other considerations like how important is speed vs. size in a specific example, but I'm looking for general guildlines.
I also know that I shouldn't be optimizing this sort of thing unless good profiling has indicated that this is a bottleneck. I'm just thinking about a project that would use a lot of hash tables and wondered how to approach this.
The Initial Capacity is essentially the number of buckets in the HashMap which by default is 24 = 16. A good HashMap algorithm will distribute an equal number of elements to all the buckets. Say we have 16 elements then each bucket will have 1 node, the search for any element will be achieved with 1 lookup.
djb2 - faster hash function - 256 buckets (higher memory consumption)
But a good general “rule of thumb” is: The hash table should be an array with length about 1.3 times the maximum number of keys that will actually be in the table, and. Size of hash table array should be a prime number.
Hash buckets are used to apportion data items for sorting or lookup purposes. The aim of this work is to weaken the linked lists so that searching for a specific item can be accessed within a shorter timeframe. A hash table that uses buckets is actually a combination of an array and a linked list.
A good rule of the thumb (not always ideal, well, just a rule of the thumb) is to re-hash if the hashtable is filled up to 80%. That means if you have 100 buckets and 80 items inside, regardless how many collision you had before, it's getting time to increase capacity.
How much should you increase it? Well, there is also no perfect value. Simplest solution is to double capacity on each increase. So it goes to 200, 400, 800, and so on. If you think this is too much (after all it will jump from 8 MB memory to 16 MB when the hashtable gets really large and you may never fill up the 16 MB), choose a smaller grow factor. At least 1/3 is recommend (growing it from 100 to 133) I'd say, maybe let it grow by 50% each time as a compromise.
Note that all this also depends how collisions are handled. A simple way to handle them (my personal favorite) is to store the items in a linked list when there is a collision. If 3 items are placed at the same key, there are still only up to 3 compares to find it. Since linked list are very ineffective for searching, you may want to increase capacity earlier, e.g. if 60% capacity is used to keep the hashtable fast. OTOH, you can do something more sophisticated and keep stats about the number of collisions. As long as you hardly have any collisions (if you have a very good hash function) there is no need to re-hash at all, even if 99% of its capacity is in use. Also if you handle collisions in a sophisticated way (e.g. each node is again a sorted table and you can perform binary search within these) your lookup might still be fast enough if the table is loaded to 200% (so you have twice as many items as capacity). In that case you could keep stats how big the largest sorted table is and when it gets bigger than, let's say, 8 entries, you think this is getting too slow and then you re-hash.
Re-hashing is very slow, so it should be avoided as often as possible. Thus if you need to re-hash, don't just grow capacity too little, otherwise you have to re-hash again pretty soon when adding more items. So when you need to re-hash, make the capacity significantly larger than the number of items currently in the table, everything else is too little capacity.
Generally, you look out for the load factor (informally, you already said that) which is formally defined as α = n / N, i.e. the ratio of used to total buckets. In order for a hash table to function properly (or at least to reason about its performance in mathematical terms), it should be α < 1.
Everything else is really up to empirical tests: If you see that your hash table doesn't perform good starting at α > 0.5, then be sure to stay under that value. This value also depends on your collision resolution techique. Hashing with chaining may require other load factors than hashing with open addressing. Yet another factor is cache locality. If your table gets too big, it won't fit into the main memory. Since your access into the array is random, loading from cache may become a bottleneck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With