I am trying to understand the Python hash
function under the hood. I created a custom class where all instances return the same hash value.
class C: def __hash__(self): return 42
I just assumed that only one instance of the above class can be in a dict
at any time, but in fact a dict
can have multiple elements with the same hash.
c, d = C(), C() x = {c: 'c', d: 'd'} print(x) # {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'} # note that the dict has 2 elements
I experimented a little more and found that if I override the __eq__
method such that all the instances of the class compare equal, then the dict
only allows one instance.
class D: def __hash__(self): return 42 def __eq__(self, other): return True p, q = D(), D() y = {p: 'p', q: 'q'} print(y) # {<__main__.D object at 0x7f0823a9af40>: 'q'} # note that the dict only has 1 element
So I am curious to know how a dict
can have multiple elements with the same hash.
Python dictionary multiple keys with same name In Python dictionary, if you want to display multiple keys with the same value then you have to use the concept of for loop and list comprehension method. Here we create a list and assign them a value 2 which means the keys will display the same name two times.
Python uses a method called Open Addressing for handling collisions. It also resizes the hash tables when it reaches a certain size, but we won't discuss that aspect. Open Addressing definition from Wikipedia: In another strategy, called open addressing, all entry records are stored in the bucket array itself.
In Python, the Dictionary data types represent the implementation of hash tables. The Keys in the dictionary satisfy the following requirements. The keys of the dictionary are hashable i.e. the are generated by hashing function which generates unique result for each unique value supplied to the hash function.
A dictionary is a data structure that maps keys to values. A hash table is a data structure that maps keys to values by taking the hash value of the key (by applying some hash function to it) and mapping that to a bucket where one or more values are stored.
Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to Duncan for pointing out that Python dicts use slots and leading me down this rabbit hole.
O(1)
lookup by index). The figure below is a logical representation of a python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).
# Logical model of Python Hash table -+-----------------+ 0| <hash|key|value>| -+-----------------+ 1| ... | -+-----------------+ .| ... | -+-----------------+ i| ... | -+-----------------+ .| ... | -+-----------------+ n| ... | -+-----------------+
When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
i
that is based on the hash of the key. CPython uses initial i = hash(key) & mask
. Where mask = PyDictMINSIZE - 1
, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.<hash|key|value>
). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)==
comparison not the is
comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing. There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (==
) of the keys when inserting items. So in summary, if there are two keys, a
and b
and hash(a)==hash(b)
, but a!=b
, then both can exist harmoniously in a Python dict. But if hash(a)==hash(b)
and a==b
, then they cannot both be in the same dict.
Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments).
I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)"
While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?
For a detailed description of how Python's hashing works see my answer to Why is early return slower than else?
Basically it uses the hash to pick a slot in the table. If there is a value in the slot and the hash matches, it compares the items to see if they are equal.
If the hash matches but the items aren't equal, then it tries another slot. There's a formula to pick this (which I describe in the referenced answer), and it gradually pulls in unused parts of the hash value; but once it has used them all up, it will eventually work its way through all slots in the hash table. That guarantees eventually we either find a matching item or an empty slot. When the search finds an empty slot, it inserts the value or gives up (depending whether we are adding or getting a value).
The important thing to note is that there are no lists or buckets: there is just a hash table with a particular number of slots, and each hash is used to generate a sequence of candidate slots.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With