I am trying to understand the Python <code>hash</code> function under the hood. I created a custom class where all instances return the same hash value. <pre class="prettyprint"><code>class C: def __hash__(self): return 42 </code></pre> I just assumed that only one instance of the above class can be in a <code>dict</code> at any time, but in fact a <code>dict</code> can have multiple elements with the same hash. <pre class="prettyprint"><code>c, d = C(), C() x = {c: 'c', d: 'd'} print(x) # {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'} # note that the dict has 2 elements </code></pre> I experimented a little more and found that if I override the <code>__eq__</code> method such that all the instances of the class compare equal, then the <code>dict</code> only allows one instance. <pre class="prettyprint"><code>class D: def __hash__(self): return 42 def __eq__(self, other): return True p, q = D(), D() y = {p: 'p', q: 'q'} print(y) # {<__main__.D object at 0x7f0823a9af40>: 'q'} # note that the dict only has 1 element </code></pre> So I am curious to know how a <code>dict</code> can have multiple elements with the same hash.

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to Duncan for pointing out that Python dicts use slots and leading me down this rabbit hole. <ul> <li>Python dictionaries are implemented as hash tables.</li> <li>Hash tables must allow for hash collisions i.e. even if two keys have same hash value, the implementation of the table must have a strategy to insert and retrieve the key and value pairs unambiguously.</li> <li>Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).</li> <li>Python hash table is just a continguous block of memory (sort of like an array, so you can do <code>O(1)</code> lookup by index). </li> <li> Each slot in the table can store one and only one entry. This is important</li> <li>Each entry in the table actually a combination of the three values - . This is implemented as a C struct (see dictobject.h:51-56)</li> <li> The figure below is a logical representation of a python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!). <pre class="prettyprint"><code># Logical model of Python Hash table -+-----------------+ 0| <hash|key|value>| -+-----------------+ 1| ... | -+-----------------+ .| ... | -+-----------------+ i| ... | -+-----------------+ .| ... | -+-----------------+ n| ... | -+-----------------+ </code></pre> </li> <li>When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)</li> <li>When adding entries to the table, we start with some slot, <code>i</code> that is based on the hash of the key. CPython uses initial <code>i = hash(key) & mask</code>. Where <code>mask = PyDictMINSIZE - 1</code>, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.</li> <li>If that slot is empty, the entry is added to the slot (by entry, I mean, <code><hash|key|value></code>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)</li> <li>If the slot is occupied, CPython (and even PyPy) compares the the hash AND the key (by compare I mean <code>==</code> comparison not the <code>is</code> comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing. </li> <li>Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.</li> <li>The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.</li> <li>BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)</li> </ul> There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (<code>==</code>) of the keys when inserting items. So in summary, if there are two keys, <code>a</code> and <code>b</code> and <code>hash(a)==hash(b)</code>, but <code>a!=b</code>, then both can exist harmoniously in a Python dict. But if <code>hash(a)==hash(b)</code> and <code>a==b</code>, then they cannot both be in the same dict. Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments). I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)" While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?

Why can a Python dict have multiple keys with the same hash?

Tags:

python

dictionary

equality

hash

set

I am trying to understand the Python hash function under the hood. I created a custom class where all instances return the same hash value.

class C:     def __hash__(self):         return 42

I just assumed that only one instance of the above class can be in a dict at any time, but in fact a dict can have multiple elements with the same hash.

c, d = C(), C() x = {c: 'c', d: 'd'} print(x) # {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'} # note that the dict has 2 elements

I experimented a little more and found that if I override the __eq__ method such that all the instances of the class compare equal, then the dict only allows one instance.

class D:     def __hash__(self):         return 42     def __eq__(self, other):         return True  p, q = D(), D() y = {p: 'p', q: 'q'} print(y) # {<__main__.D object at 0x7f0823a9af40>: 'q'} # note that the dict only has 1 element

So I am curious to know how a dict can have multiple elements with the same hash.

606

asked Jan 25 '12 20:01

Praveen Gollakota

2 Answers

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to Duncan for pointing out that Python dicts use slots and leading me down this rabbit hole.

Python dictionaries are implemented as hash tables.
Hash tables must allow for hash collisions i.e. even if two keys have same hash value, the implementation of the table must have a strategy to insert and retrieve the key and value pairs unambiguously.
Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).
Python hash table is just a continguous block of memory (sort of like an array, so you can do O(1) lookup by index).
Each slot in the table can store one and only one entry. This is important
Each entry in the table actually a combination of the three values - . This is implemented as a C struct (see dictobject.h:51-56)

The figure below is a logical representation of a python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).

# Logical model of Python Hash table -+-----------------+ 0| <hash|key|value>| -+-----------------+ 1|      ...        | -+-----------------+ .|      ...        | -+-----------------+ i|      ...        | -+-----------------+ .|      ...        | -+-----------------+ n|      ...        | -+-----------------+

When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
When adding entries to the table, we start with some slot, i that is based on the hash of the key. CPython uses initial i = hash(key) & mask. Where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
If the slot is occupied, CPython (and even PyPy) compares the the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)

There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (==) of the keys when inserting items. So in summary, if there are two keys, a and b and hash(a)==hash(b), but a!=b, then both can exist harmoniously in a Python dict. But if hash(a)==hash(b) and a==b, then they cannot both be in the same dict.

Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments).

I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)"

While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?

answered Oct 20 '22 15:10

Praveen Gollakota

For a detailed description of how Python's hashing works see my answer to Why is early return slower than else?

Basically it uses the hash to pick a slot in the table. If there is a value in the slot and the hash matches, it compares the items to see if they are equal.

If the hash matches but the items aren't equal, then it tries another slot. There's a formula to pick this (which I describe in the referenced answer), and it gradually pulls in unused parts of the hash value; but once it has used them all up, it will eventually work its way through all slots in the hash table. That guarantees eventually we either find a matching item or an empty slot. When the search finds an empty slot, it inserts the value or gives up (depending whether we are adding or getting a value).

The important thing to note is that there are no lists or buckets: there is just a hash table with a particular number of slots, and each hash is used to generate a sequence of candidate slots.

answered Oct 20 '22 16:10

Duncan

Related questions
                            
                                Full examples of using pySerial package [closed]
                            
                                Python, what's the Enum type good for? [duplicate]
                            
                                Implementing use of 'with object() as f' in custom class in python
                            
                                How to locate and insert a value in a text box (input) using Python Selenium?
                            
                                Python Pandas: Convert ".value_counts" output to dataframe
                            
                                RuntimeError: This event loop is already running in python
                            
                                `if key in dict` vs. `try/except` - which is more readable idiom?
                            
                                Pythonic type hints with pandas?
                            
                                Combine two pandas Data Frames (join on a common column)
                            
                                Django Setup Default Logging
                            
                                Convert Python dictionary to JSON array
                            
                                python: Appending a dictionary to a list - I see a pointer like behavior
                            
                                secret key not set in flask session, using the Flask-Session extension
                            
                                Pandas: rolling mean by time interval
                            
                                how to convert a string date into datetime format in python? [duplicate]
                            
                                Jupyter notebook not trusted
                            
                                How should I declare default values for instance variables in Python?
                            
                                How to read file with space separated values in pandas
                            
                                Quantile-Quantile Plot using SciPy
                            
                                Functional pipes in python like %>% from R's magrittr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why can a Python dict have multiple keys with the same hash?

Tags:

python

dictionary

equality

hash

set

Praveen Gollakota

People also ask

2 Answers

Praveen Gollakota

Duncan

Recent Activity

Donate For Us