How many hash buckets

Tags:

If I notice that a hash table (or any other data structure built on a hash table) is filling up, at what point should you build a new table with more buckets. And given n items in the table so far, how do you figure out how many buckets to use in the new one?

So let's say I have 100 buckets. Should I reorganize it when there are 50 items in it? 500? 5000? Or should I look for the most-full bucket and key on that? Then when I hit that point how big do I make the new hash table?

Related to this, if you know beforehand roughly how many items will go in, is there a way to compute the number of buckets to get a good average performance?

I know the real answer depends on a lot of other considerations like how important is speed vs. size in a specific example, but I'm looking for general guildlines.

I also know that I shouldn't be optimizing this sort of thing unless good profiling has indicated that this is a bottleneck. I'm just thinking about a project that would use a lot of hash tables and wondered how to approach this.

379

asked Oct 22 '08 12:10

Matt

2 Answers

A good rule of the thumb (not always ideal, well, just a rule of the thumb) is to re-hash if the hashtable is filled up to 80%. That means if you have 100 buckets and 80 items inside, regardless how many collision you had before, it's getting time to increase capacity.

How much should you increase it? Well, there is also no perfect value. Simplest solution is to double capacity on each increase. So it goes to 200, 400, 800, and so on. If you think this is too much (after all it will jump from 8 MB memory to 16 MB when the hashtable gets really large and you may never fill up the 16 MB), choose a smaller grow factor. At least 1/3 is recommend (growing it from 100 to 133) I'd say, maybe let it grow by 50% each time as a compromise.

Note that all this also depends how collisions are handled. A simple way to handle them (my personal favorite) is to store the items in a linked list when there is a collision. If 3 items are placed at the same key, there are still only up to 3 compares to find it. Since linked list are very ineffective for searching, you may want to increase capacity earlier, e.g. if 60% capacity is used to keep the hashtable fast. OTOH, you can do something more sophisticated and keep stats about the number of collisions. As long as you hardly have any collisions (if you have a very good hash function) there is no need to re-hash at all, even if 99% of its capacity is in use. Also if you handle collisions in a sophisticated way (e.g. each node is again a sorted table and you can perform binary search within these) your lookup might still be fast enough if the table is loaded to 200% (so you have twice as many items as capacity). In that case you could keep stats how big the largest sorted table is and when it gets bigger than, let's say, 8 entries, you think this is getting too slow and then you re-hash.

Re-hashing is very slow, so it should be avoided as often as possible. Thus if you need to re-hash, don't just grow capacity too little, otherwise you have to re-hash again pretty soon when adding more items. So when you need to re-hash, make the capacity significantly larger than the number of items currently in the table, everything else is too little capacity.

answered Sep 17 '22 17:09

Mecki

Generally, you look out for the load factor (informally, you already said that) which is formally defined as α = n / N, i.e. the ratio of used to total buckets. In order for a hash table to function properly (or at least to reason about its performance in mathematical terms), it should be α < 1.

Everything else is really up to empirical tests: If you see that your hash table doesn't perform good starting at α > 0.5, then be sure to stay under that value. This value also depends on your collision resolution techique. Hashing with chaining may require other load factors than hashing with open addressing. Yet another factor is cache locality. If your table gets too big, it won't fit into the main memory. Since your access into the array is random, loading from cache may become a bottleneck.

answered Sep 17 '22 17:09

Konrad Rudolph

Related questions
                            
                                Hashtable/dictionary/map lookup with regular expressions
                            
                                What Vim features are missing in Emacs with Viper and Vimpulse?
                            
                                IntelliJ configuration directory
                            
                                Derivatives in C/C++?
                            
                                Anti-pattern of SOA or WCF
                            
                                how to generically compare entire java beans?
                            
                                Programmatically access CPU fan on a laptop? (Windows)
                            
                                What are the ways to find bottlenecks in a web application?
                            
                                Lisp grammar in yacc
                            
                                OracleParameter and IN Clause
                            
                                Logging Java web applications?
                            
                                When should I use public/private/static methods? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With