Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Basic Hashtable algorithm - removing duplicates

I just had an interview this morning and I was given the question "Give an algorithm for removing duplicates from a list of integers". This is a fairly standard question so I was pretty confident I could answer it.

I am paraphrasing, but I said something along the lines of "You could use a hashtable. Start with the first integer and insert it into the hashtable. Then for each successive integer do a hashtable lookup to check if the integer is already in the hashtable, if not then insert it, if its already there then throw it away because it is a duplicate. So iterate though the list in this way. If the hashtable is designed correctly, the lookups and inserts should be constant time on average."

Then the interviewer responded (again I am paraphrasing) "But hashtable lookups are not constant time, they depend on how many elements are already in it. The algorithm you described would be O(n^2)"

Then I responded "Really? I thought that if you designed a good hash-function, it would be constant time? Making it O(n) typically"

Then the interviewer responded "So you are saying that the lookup time would be the same for a hash table with many entries and a hashtable with few entries"

Then I said "Yes. If it is designed correctly."

Then the interviewer said "This is not true"

SO I am very confused right now. If someone could point out where I am wrong, I would be very grateful

like image 720
user1893354 Avatar asked May 16 '13 13:05

user1893354


1 Answers

If someone could point out where I am wrong

You are not wrong at all: properly designed hash tables gives you an expected lookup efficiency of O(1) and inserts in amortized O(1), so your algorithm is O(N). Lookup in heavily loaded hash tables is indeed a little slower because of possible duplicate resolution, but the expected lookup time remains O(1). This may not be good enough for real-time systems where "amortized" does not count, but in all practical situations this is enough.

Of course you could always use a balanced tree for the items that you've seen for a worst-case O(N*LogN) algorithm, or if the numbers have reasonable bounds (say, between 0 and 100,000) you could use a boolean array to test membership in O(1) worst-case, and a potential improvement over a hash table because of a smaller constant multiplier.

like image 71
Sergey Kalinichenko Avatar answered Oct 11 '22 12:10

Sergey Kalinichenko