Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Google stupid or is it me? Hash sets with possible collisions vs classic safe algorithms

So, I was watching one of these Google videos on how they conduct interviews (https://www.youtube.com/watch?v=XKu_SEDAykw) and I found their solution quite weird. Since there are a lot of clever people working at Google, I am wondering now if I got something wrong or if they do. Let me sum up the task and solution in case you don't want to watch it:

The task is to give an efficient algorithm for the following problem:

Given an array A of integers and a separate integer a, find two indices i,j, such that A[i]+A[j] = a.

They start off with the array being sorted and produce a nice linear time algorithm. But then the interviewer asks what happens if the array isn't sorted. And they propose the following linear time algorithm (they say that first sorting the array and then using their linear time algorithm is too slow, although it would run in nlogn time):

They go through the array from first to last and use a hash set to store the numbers they have already seen. Then they only need to check against the hash set for every index of the array (i.e. did i already see the number I need to get the sum) and since that is apparently possible in constant time, the whole algorithm is running in linear time (essentially number of hash sets * Array.length).

Now to my criticism: I think there is a huge flaw in this solution, which essentially lies in the possibility of collisions. Since they assume nlogn to be slow, we can assume that the hash set has less than logn many different entries. Now, given any large input, the probability of getting a collision tends to 1 when hashing n numbers into at most logn many sets. So they trade a very modest speed increase (they assume that ten billion is large for that array, but then the log (base 2) is still less than 30. However, matching this speed with the hash set algorithm would mean that over 300 million numbers would be hashed to the same spot) for an almost definite erroneous algorithm.

I either misunderstand something with hashing or this is an awful solution for the problem. Again the safe nlogn algorithm is not much slower than the one they give, unless the array gets so big that the hash algorithm will get a collision for sure.

I wouldn't be surprised if a constant time algorithm that throws a coin for small arrays and always says yes for large arrays would have the same rate of success on average as their hash set solution.

If I misunderstand something about hashing please point that out, because I find it rather hard to believe that they would make such an error at a top notch computer engineering company.

like image 261
Sebastian Mueller Avatar asked Feb 05 '23 20:02

Sebastian Mueller


1 Answers

To be clear, a "hash set" is a hash table where the key is the entire entry; there is no associated value, so the only interesting fact about a key is that it is present. This is a minor detail in the implementation of a hash table.

As has been noted, there is no basis to your statement that the size of the hash set needs to be less than log n in order to reduce lookup time. It's the other way around: the size of the hash set (the number of buckets) should be linear in the size of the dataset, so that the expected length of a hash chain is O(1). (For complexity analysis, it doesn't matter whether the expected length of a hash chain is 1 or 1,000: both are O(1).)

But even if the expected hash table lookup wasn't O(1), there is still a huge advantage to hashing over sorting: hashing is easily parallelizable. And that's something very important to Google, since only parallel algorithms can cope with Google-sized data sets.

In practice, a googly solution to this problem (I think: I haven't watched the video) would be to use two different hashes. The first hash assigns each number to a server, so it has a very large bucket size since each server has a lot of data. Each server then uses its own hash function to map its own data to its own buckets.

Now I can scan the entire dataset in parallel (using other servers), and for each entry, ask the appropriate storage server (which I work out by using the primary hash) whether the additive inverse is in its dataset. Since every entry can only be stored on one server (or set of servers, if the data is replicated for reliability), I don't actually have to disturb the irrelevant servers. (In practice, I would take a bunch of queries, bucket them by server, and then -- in parallel -- send each server a list of queries, because that reduces connection setup time. But the principal is the same.)

That's a very easy and almost infinitely scalable approach to the problem, and I think that an interviewer would be happy to hear about it. Parallel sorting is much more difficult, and in this case the complexity is totally unnecessary.

Of course, you may well have a good argument in favour of your own preferred strategy, and a good interviewer will be happy to hear a good argument as well, even if it's not one they thought of before. Good engineers are always open to discussing good ideas. And that discussion cannot start with the assumption that one of two different ideas must be "stupid".

like image 121
rici Avatar answered Feb 23 '23 00:02

rici