Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is a hash table lookup only O(1) time when searching for a key is O(n)?

Technically speaking, based on posts I've read here, Hash table is indeed O(n) time lookup in the worst case. But I don't get how the internal mechanics guarantee it to be O(1) time on average.

My understanding is that given some n elements, the ideal case is there are n buckets, which result in O(1) space. This is where I'm stuck. Suppose I want to lookup whether a key is in the dictionary, this definitely takes me O(n) time. So why does it make a difference when I want to search whether an element is in the hash table by using its key's hash value? To put it concisely, searching using raw key values gives O(n) time but using hash values it's O(1) time. Why is that?

Wouldn't I still need to look up the hash values one by one to see which one matches? Why does the hashing let me know immediately which element to retrieve or whether such an element exists or not?

like image 804
oldselflearner1959 Avatar asked Feb 03 '18 23:02

oldselflearner1959


2 Answers

I think you are confusing terminology and also complicating matters by thinking about buckets.

Let's imagine a hash table that is implemented in as an array a of length n. Let's also imagine we have n possible keys and a perfect hash function H that maps each key k to a unique index i in a.

Let's initialize our hash table by setting every value in a to nil.

We can insert a key, value pair (k1, v1) into our hash table by placing the value at the appropriate position in the array:

a[H(k1)] = v1

Now let's say that later we forgot if k1 is in the hash table and we want to check if it's there. To do this we simply look up a[H(k1)] and see if any value is there, i.e. a[H(k1)] != nil. This is clearly a constant time lookup.

But what if we want to see if v1, or even some other v2 is anywhere in our hashtable? This is not so easy because we have no function that maps a vi to a position in our array. It could be associated with any key. So the only way to see if it exists in the table is to scan the entire array, checking every value:

for i in 0..n-1:
  if a[i] == v2:
    return true
return false

To make this a bit more concrete, imagine your keys are names, and your values are cities of residence. Now compare asking "Is Bob Jones in the hash table?" to "Is there anyone from New York in the hash table?". We can hash "Bob Jones" and see if there's anything in the corresponding array position (because that's how "Bob Jones" would have been inserted), but we have no similarly quick way to look up "New York".

I am assuming this is what you are asking, and you have confused the terminology a bit. Please comment if this is not what you wanted.

like image 140
Imran Avatar answered Oct 09 '22 07:10

Imran


Sounds like you are in search of more elaborated explanation!

I assume that you already understand that array element lookup takes O(1) i.e. if I already knew that I wanted to lookup 100th element in array then it would only take me O(1) because this is a simple memory address lookup (by adding 100 to the first element's address).

The method of hashing makes use of this memory address lookup to achieve O(1) average time. Now obviously this means that you need to be able to convert lookup key to memory address. Let me give you a very simplified example of how this works in a hashtable (just to be clear, dictionaries implement hashtable under the hood so when I mention hashtable , the exact same principles are applicable to dictionaries too).

The simplified example scenario; we need to lookup mailing addresses of customers by their first name. For simplicity assume that names are going to be unique and they've normal a to z letter. Let's say initially we are designing this for only 10 customers (i.e. their names and their addresses).

Now lets say that we must solve this by storing name-addresses pairs in hashtable and we must create our own hashfunction!!! A hashfunction which would take name as a parameter and convert it into a memory lookup!!.

Now take a moment and think how many arrays are required here? What would be their type and what would be their size?
We definitely need one array to store mailing addresses. What should be the size? Well we need to store 10 mailing addresses so size has to be 10! We would also need second array for storing element indexes of first array!! Or in other words we need a second array to store references to mailing addresses (from first array) for our customer names. What should be the size of this array? Definitely larger than 10! But it really comes down to the hashfunction we design. For simplicity let's create a hashfunction which simply takes first letter of the name parameter and converts it to index. i.e. If name starts from A then it hashvalue is 1, for b it is 2, for c it is 3... for z it is 26. So at the least our lookup array size has to be 26 (you must be thinking that that's the wastage of lot of space for storing 10 addresses!! but it may worth it since it is going to give us performance) Let’s try to understand this with an example. Let’s say our first customer name is Bob. To store address for Bob first step is to find the first empty element in mailing address array. This is the first name so whole mailing address array is empty. We can store Bob’s address at index zero in mailing address array. When we store this address we’ll also mark it as Bob’s address at index 0. (I’m using this ‘marking’ terminology to later explain lookup vs searching) Then we find out hashvalue for name Bob. In this case it would be 2! So in the lookup array at location 2 we store 0. (i.e. the index to mailing address for Bob). Now lets say our second customer is Hamish; we store mailing address for Hamish at index 1 (i.e. the second element) in mailing address array; mark it as Hamish’s address and then we find out hashvalue for Hamish. Since Hamish starts from ‘H’, value would be 8. So in our lookup array at location 8 we store value 1 (i.e. the index for Hamish’s address). We can repeat this procedure for all 10 customers and store their addresses. Now later when you want to lookup Bob’s address you could look it up very fast simply by following a simple two step procedure. Step 1- convert name Bob to hashvalue ; answer is 2; go ahead and check location 2 in mailing address array; if it is marked as Bob’s address then return location 2 !! Same for Hamish; H-> gives 8. Go ahead and look up address from location 8; if it is marked as Hamish’s address then return address from location 8. This mechanism is called ‘looking up’. If you hadn’t created the second array (lookup array) then you’d only have mailing address array and you’d have to go through each address one by one and check if it is marked with the customer name you are looking for or not!. Now, what if there are two customer names starting with same letter? That’s called hash collision and that can be dealt with different approaches. What if we need to store 10000 names? That means we are have to use a better hashfunction which would give us less hash collisions. I’m not covering those two terminology here since I believe that question only demanded explaining lookup vs searching.

like image 42
sbp Avatar answered Oct 09 '22 07:10

sbp