Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Interview Question [closed]

This was one of Google Interview questions.

What is the possible problem if Hash Table grows more than 30 gb (ignore problems like bad hash function )

I didn't know it. What could be satisfactory answer ?

Thanks

like image 883
alan Avatar asked Sep 06 '11 06:09

alan


1 Answers

The answer partly depends on whether they are talking about a classic hashtable implementation (like HashTable / HashMap in Java) or something more sophisticated. In the end, 30 GB of memory is still quite big for a single machine/VM by today's standards.

So think about what's going on underneath:

  1. It has to read write at an arbitrary position in some massive array.
  2. It has to grow if it fills up beyond some measure; see 'load factor' in Java implementation.
  3. In a garbage collected language/implementation, all the objects stored in the hash table need to be inspected by the garbage collector

Which leads to the following problems:

  1. It is unclear that even today's operating systems deal well with allocating chunks of memory in the tens of GBs
  2. For simplicity, say that half of the table was actually used by the table itself (not the key and value objects). So there is a 15 gb array inside. So every time the table grows, you need to allocate at least another 15 gb
  3. Even if a tens of GB array was allocated, the OS would page some of this memory. Since we are assuming a good hash function, we will break page caching if we use most of the data in the array. There will be a lot of page faults.
  4. Let's say we don't use all the data. Some keys are used frequently, and others are not. To illustrate, say that each key-value is tiny -- 128 bytes. And for simplicity, say that we store everything in the hash table as values. So 30G/128 = ~ 250M entries. But say 25k commonly used keys. (25k / 250M = 0.01%). But with good a hash function, these would be evenly scattered across the massive array. Even with small page sizes -- say 4kb, the 25K (entries) * 128 bytes (entry size) = ~3.5Mb worth of commonly used data costs us 25K (entries) * 4K (page size) = ~ 100Mb of memory that needs to be kept paged in... at a whopping 3.5% efficiency!
  5. In the Java world, practitioners don't recommend heap sizes larger that 4 - 8Gb. Sure there's stuff like Azul, but that simply proves the point -- a typical garbage collector don't scale to these sizes very well.

I agree with other posters that Google is looking for distributed as a solution. But I think at the heart, a simple hashtable stops scaling beyond a point. In the above,

  1. You would have to distribute if all the entries are accessed relatively evenly
  2. If some are accessed a majority of the time, using two maps (one for most frequently used) can buy you a lot.
  3. In the Java world, using specialized maps that store data off the heap can buy you performance too; see Peter Lawrey's work for example.
  4. Even simply striping the underlying array in a hashtable (like Java's ConcurrentHashMap does) can buy you major improvements when you have to grow the hashtable.
like image 77
Dilum Ranatunga Avatar answered Oct 13 '22 07:10

Dilum Ranatunga