How Python stores the dict key, values when collision occurs in hash table? Whats the hash algorithm used to get the hash value here?

For the "normal" Python, this great writeup by Praveen Gollakota explains it very well, here are the important bits: <ul> <li>Python dictionaries are implemented as hash tables. Hash tables consist of slots, and keys are mapped to the slots via a hashing function.</li> <li>Hash table implementations must allow for hash collisions i.e. even if two keys have same hash value, the implementation of the table must have a strategy to insert and retrieve the key and value pairs unambiguously.</li> <li>Python dict uses open addressing to resolve hash collisions (see dictobject.c:296-297).</li> <li>In open addressing, hash collisions are resolved by probing (explained below) .</li> <li>The hash table is just a contiguous block of memory (like an array, so you can do <code>O(1)</code> lookup by index). </li> <li>Each slot in the hash table can store one and only one entry. This is important.</li> <li>Each entry in the table actually a combination of the three items - <code><hash, key, value></code>. This is implemented as a C struct (see dictobject.h:51-56).</li> <li>When a new dict is initialized, it starts with 8 slots. (see dictobject.h:49)</li> <li>When adding entries to the table, we start with some slot, <code>i</code> that is based on the hash of the key. CPython uses initial <code>i = hash(key) & mask</code>, where <code>mask = PyDictMINSIZE - 1</code>, but that's not really important. Just note that the initial slot, <code>i</code>, that is checked depends on the hash of the key.</li> <li>If that slot is empty, the entry is added to the slot (by entry, I mean, <code><hash|key|value></code>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)</li> <li>If the slot is occupied, CPython (and even PyPy) compares the hash and the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.</li> <li>Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, <code>i+1</code>, <code>i+2</code>, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.</li> <li>The same thing happens for lookups, just starts with the initial slot <code>i</code> (where <code>i</code> depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.</li> <li>To avoid slowing down lookups, the dict will be resized when it is two-thirds full (see dictobject.h:64-65).</li> </ul>

How Python dict stores key, value when collision occurs? [duplicate]

2 Answers

For the "normal" Python, this great writeup by Praveen Gollakota explains it very well, here are the important bits:

Python dictionaries are implemented as hash tables. Hash tables consist of slots, and keys are mapped to the slots via a hashing function.
Hash table implementations must allow for hash collisions i.e. even if two keys have same hash value, the implementation of the table must have a strategy to insert and retrieve the key and value pairs unambiguously.
Python dict uses open addressing to resolve hash collisions (see dictobject.c:296-297).
In open addressing, hash collisions are resolved by probing (explained below) .
The hash table is just a contiguous block of memory (like an array, so you can do O(1) lookup by index).
Each slot in the hash table can store one and only one entry. This is important.
Each entry in the table actually a combination of the three items - <hash, key, value>. This is implemented as a C struct (see dictobject.h:51-56).
When a new dict is initialized, it starts with 8 slots. (see dictobject.h:49)
When adding entries to the table, we start with some slot, i that is based on the hash of the key. CPython uses initial i = hash(key) & mask, where mask = PyDictMINSIZE - 1, but that's not really important. Just note that the initial slot, i, that is checked depends on the hash of the key.
If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
If the slot is occupied, CPython (and even PyPy) compares the hash and the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
To avoid slowing down lookups, the dict will be resized when it is two-thirds full (see dictobject.h:64-65).

193

answered Sep 22 '22 03:09

Burhan Khalid

The short version: The Python spec doesn't specify a dictionary implementation, but CPython uses a hash map and handles collisions with open addressing.

See this answer to a similar question and also the Wikipedia page on hash tables.

answered Sep 19 '22 03:09

Andrew Gorcester

Related questions
                            
                                map values in a dataframe from a dictionary using pyspark
                            
                                Python: Variance of a list of defined numbers
                            
                                A replacement for python's httplib?
                            
                                How to increment a value with leading zeroes?
                            
                                SQLite parameter substitution and quotes
                            
                                What does '_' do in Django code?
                            
                                Python unhash value
                            
                                easy_install fails on error "Couldn't find setup script" after binary upload?
                            
                                How to encode (utf8mb4) in Python
                            
                                No module named 'allauth.account.context_processors'
                            
                                Running Python scripts with Xampp
                            
                                Tool to convert python indentation from spaces to tabs? [closed]
                            
                                Parsing srt subtitles
                            
                                Subtracting the current and previous item in a list
                            
                                Python read in string from file and split it into values [closed]
                            
                                Python TA-Lib install problems
                            
                                How to handle functions return value in Python
                            
                                django object is not JSON serializable error after upgrading django to 1.6.5
                            
                                Perl like regex in Python
                            
                                Python Data structure index Start at 1 instead of 0?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How Python dict stores key, value when collision occurs? [duplicate]

Tags:

python

dictionary

hashtable

ShanmugavelSubramani

People also ask

2 Answers

Burhan Khalid

Andrew Gorcester

Recent Activity

Donate For Us