Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do Python dictionary hash lookups work?

Tags:

How do Python dictionary lookup algorithms work internally?

mydi['foo']  

If the dictionary has 1,000,000 terms, is a tree search executed? Would I expect performance in terms of the length of the key string, or the size of the dictionary? Maybe stuffing everything into a dictionary is just as good as writing a tree search index for strings of size 5 million?

like image 983
shigeta Avatar asked Jul 07 '11 02:07

shigeta


People also ask

How does Python dictionary lookup work?

Python gets a benefit from this as most name lookups involve dictionaries and a single copy of each variable or attribute name is stored internally, so every time you access an attribute x.y there is a dictionary lookup but not a call to a hash function.

How does Python dictionary hash work?

The dictionary uses each key's hash function to change some of the key's information into an integer known as a hash value. The hash value tells you which bucket that key-value pair should be placed in. This way every time you need to lookup or find that key-value pair, you know exactly which bucket to search.

Do Python dictionaries use hash?

Dictionaries in Python are implemented using hash tables. It is an array whose indexes are obtained using a hash function on the keys.

Is a Python dictionary a hash table or hash map?

Yes, it is a hash mapping or hash table. You can read a description of python's dict implementation, as written by Tim Peters, here. You can read more about hash tables or check how it has been implemented in python and why it is implemented that way.


1 Answers

Here's some pseudo-code closer to what actually happens. Imagine the dictionary has a data attribute containing the key,value pairs and a size which is the number of cells allocated.

def lookup(d, key):     perturb = j = hash(key)     while True:         cell = d.data[j % d.size]         if cell.key is EMPTY:             raise IndexError         if cell.key is not DELETED and (cell.key is key or cell.key == key):             return cell.value         j = (5 * j) + 1 + perturb         perturb >>= PERTURB 

The perturb value ensures that all bits of the hash code are eventually used when resolving hash clashes but once it has degraded to 0 the (5*j)+1 will eventually touch all cells in the table.

size is always much larger than the number of cells actually used so the hash is guaranteed to eventually hit an empty cell when the key doesn't exist (and should normally hit one pretty quickly). There's also a deleted value for a key to indicate a cell which shouldn't terminate the search but which isn't currently in use.

As for your question about the length of the key string, hashing a string will look at all of the characters in the string, but a string also has a field used to store the computed hash. So if you use different strings every time to do the lookup the string length may have a bearing, but if you have a fixed set of keys and re-use the same strings the hash won't be recalculated after the first time it is used. Python gets a benefit from this as most name lookups involve dictionaries and a single copy of each variable or attribute name is stored internally, so every time you access an attribute x.y there is a dictionary lookup but not a call to a hash function.

like image 75
Duncan Avatar answered Oct 19 '22 09:10

Duncan