I just had an interview this morning and I was given the question "Give an algorithm for removing duplicates from a list of integers". This is a fairly standard question so I was pretty confident I could answer it.
I am paraphrasing, but I said something along the lines of "You could use a hashtable. Start with the first integer and insert it into the hashtable. Then for each successive integer do a hashtable lookup to check if the integer is already in the hashtable, if not then insert it, if its already there then throw it away because it is a duplicate. So iterate though the list in this way. If the hashtable is designed correctly, the lookups and inserts should be constant time on average."
Then the interviewer responded (again I am paraphrasing) "But hashtable lookups are not constant time, they depend on how many elements are already in it. The algorithm you described would be O(n^2)"
Then I responded "Really? I thought that if you designed a good hash-function, it would be constant time? Making it O(n) typically"
Then the interviewer responded "So you are saying that the lookup time would be the same for a hash table with many entries and a hashtable with few entries"
Then I said "Yes. If it is designed correctly."
Then the interviewer said "This is not true"
SO I am very confused right now. If someone could point out where I am wrong, I would be very grateful
If someone could point out where I am wrong
You are not wrong at all: properly designed hash tables gives you an expected lookup efficiency of O(1)
and inserts in amortized O(1)
, so your algorithm is O(N)
. Lookup in heavily loaded hash tables is indeed a little slower because of possible duplicate resolution, but the expected lookup time remains O(1)
. This may not be good enough for real-time systems where "amortized" does not count, but in all practical situations this is enough.
Of course you could always use a balanced tree for the items that you've seen for a worst-case O(N*LogN)
algorithm, or if the numbers have reasonable bounds (say, between 0 and 100,000) you could use a boolean array to test membership in O(1)
worst-case, and a potential improvement over a hash table because of a smaller constant multiplier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With