This was one of Google Interview questions.
What is the possible problem if Hash Table grows more than 30 gb
(ignore problems like bad hash function )
I didn't know it. What could be satisfactory answer ?
Thanks
The answer partly depends on whether they are talking about a classic hashtable implementation (like HashTable / HashMap in Java) or something more sophisticated. In the end, 30 GB of memory is still quite big for a single machine/VM by today's standards.
So think about what's going on underneath:
- It has to read write at an arbitrary position in some massive array.
- It has to grow if it fills up beyond some measure; see 'load factor' in Java implementation.
- In a garbage collected language/implementation, all the objects stored in the hash table need to be inspected by the garbage collector
Which leads to the following problems:
- It is unclear that even today's operating systems deal well with allocating chunks of memory in the tens of GBs
- For simplicity, say that half of the table was actually used by the table itself (not the key and value objects). So there is a 15 gb array inside. So every time the table grows, you need to allocate at least another 15 gb
- Even if a tens of GB array was allocated, the OS would page some of this memory. Since we are assuming a good hash function, we will break page caching if we use most of the data in the array. There will be a lot of page faults.
- Let's say we don't use all the data. Some keys are used frequently, and others are not. To illustrate, say that each key-value is tiny -- 128 bytes. And for simplicity, say that we store everything in the hash table as values. So 30G/128 = ~ 250M entries. But say 25k commonly used keys. (25k / 250M = 0.01%). But with good a hash function, these would be evenly scattered across the massive array. Even with small page sizes -- say 4kb, the 25K (entries) * 128 bytes (entry size) = ~3.5Mb worth of commonly used data costs us 25K (entries) * 4K (page size) = ~ 100Mb of memory that needs to be kept paged in... at a whopping 3.5% efficiency!
- In the Java world, practitioners don't recommend heap sizes larger that 4 - 8Gb. Sure there's stuff like Azul, but that simply proves the point -- a typical garbage collector don't scale to these sizes very well.
I agree with other posters that Google is looking for distributed as a solution. But I think at the heart, a simple hashtable stops scaling beyond a point. In the above,
- You would have to distribute if all the entries are accessed relatively evenly
- If some are accessed a majority of the time, using two maps (one for most frequently used) can buy you a lot.
- In the Java world, using specialized maps that store data off the heap can buy you performance too; see Peter Lawrey's work for example.
- Even simply striping the underlying array in a hashtable (like Java's ConcurrentHashMap does) can buy you major improvements when you have to grow the hashtable.