Does anyone know how the built in dictionary type for python is implemented? My understanding is that it is some sort of hash table, but I haven't been able to find any sort of definitive answer.

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). <ul> <li> Python dictionaries are implemented as hash tables. </li> <li> Hash tables must allow for hash collisions i.e. even if two distinct keys have the same hash value, the table's implementation must have a strategy to insert and retrieve the key and value pairs unambiguously. </li> <li> Python <code>dict</code> uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297). </li> <li> Python hash table is just a contiguous block of memory (sort of like an array, so you can do an <code>O(1)</code> lookup by index). </li> <li> Each slot in the table can store one and only one entry. This is important. </li> <li> Each entry in the table is actually a combination of the three values: < hash, key, value >. This is implemented as a C struct (see dictobject.h:51-56). </li> <li> The figure below is a logical representation of a Python hash table. In the figure below, <code>0, 1, ..., i, ...</code> on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!). <pre class="prettyprint"><code> # Logical model of Python Hash table -+-----------------+ 0| <hash|key|value>| -+-----------------+ 1| ... | -+-----------------+ .| ... | -+-----------------+ i| ... | -+-----------------+ .| ... | -+-----------------+ n| ... | -+-----------------+ </code></pre> </li> <li> When a new dict is initialized it starts with 8 slots. (see dictobject.h:49) </li> <li> When adding entries to the table, we start with some slot, <code>i</code>, that is based on the hash of the key. CPython initially uses <code>i = hash(key) & mask</code> (where <code>mask = PyDictMINSIZE - 1</code>, but that's not really important). Just note that the initial slot, <code>i</code>, that is checked depends on the hash of the key. </li> <li> If that slot is empty, the entry is added to the slot (by entry, I mean, <code><hash|key|value></code>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!) </li> <li> If the slot is occupied, CPython (and even PyPy) compares the hash AND the key (by compare I mean <code>==</code> comparison not the <code>is</code> comparison) of the entry in the slot against the hash and key of the current entry to be inserted (dictobject.c:337,344-345) respectively. If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing. </li> <li> Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, <code>i+1, i+2, ...</code> and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found. </li> <li> The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail. </li> <li> BTW, the <code>dict</code> will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65) </li> </ul> NOTE: I did the research on Python Dict implementation in response to my own question about how multiple entries in a dict can have same hash values. I posted a slightly edited version of the response here because all the research is very relevant for this question as well.

How are Python's Built In Dictionaries Implemented?

1 Answers

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive).

Python dictionaries are implemented as hash tables.
Hash tables must allow for hash collisions i.e. even if two distinct keys have the same hash value, the table's implementation must have a strategy to insert and retrieve the key and value pairs unambiguously.
Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).
Python hash table is just a contiguous block of memory (sort of like an array, so you can do an O(1) lookup by index).
Each slot in the table can store one and only one entry. This is important.
Each entry in the table is actually a combination of the three values: < hash, key, value >. This is implemented as a C struct (see dictobject.h:51-56).

The figure below is a logical representation of a Python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).

  # Logical model of Python Hash table   -+-----------------+   0| <hash|key|value>|   -+-----------------+   1|      ...        |   -+-----------------+   .|      ...        |   -+-----------------+   i|      ...        |   -+-----------------+   .|      ...        |   -+-----------------+   n|      ...        |   -+-----------------+

When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
When adding entries to the table, we start with some slot, i, that is based on the hash of the key. CPython initially uses i = hash(key) & mask (where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
If the slot is occupied, CPython (and even PyPy) compares the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the hash and key of the current entry to be inserted (dictobject.c:337,344-345) respectively. If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)

NOTE: I did the research on Python Dict implementation in response to my own question about how multiple entries in a dict can have same hash values. I posted a slightly edited version of the response here because all the research is very relevant for this question as well.

175

answered Sep 26 '22 17:09

Praveen Gollakota

Related questions
                            
                                Convert Django Model object to dict with all of the fields intact
                            
                                Convert Pandas Column to DateTime
                            
                                Format timedelta to string
                            
                                How to empty a list?
                            
                                UnicodeDecodeError, invalid continuation byte
                            
                                Breaking out of nested loops [duplicate]
                            
                                How to draw vertical lines on a given plot in matplotlib
                            
                                How to make inline plots in Jupyter Notebook larger? [duplicate]
                            
                                Python 3 ImportError: No module named 'ConfigParser'
                            
                                Python locale error: unsupported locale setting
                            
                                In pytest, what is the use of conftest.py files?
                            
                                Difference between filter and filter_by in SQLAlchemy
                            
                                How to convert a PIL Image into a numpy array?
                            
                                Showing the stack trace from a running Python application
                            
                                Measuring elapsed time with the Time module
                            
                                Unicode (UTF-8) reading and writing to files in Python
                            
                                How to change the figure size of a seaborn axes or figure level plot
                            
                                Accessing dict keys like an attribute?
                            
                                How can I create an object and add attributes to it?
                            
                                python .replace() regex [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How are Python's Built In Dictionaries Implemented?

Tags:

python

dictionary

data-structures

ricree

People also ask

1 Answers

Praveen Gollakota

Recent Activity

Donate For Us