Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dictionary size reduces upon increasing one element

I ran this:

import sys

diii = {'key1':1,'key2':2,'key3':1,'key4':2,'key5':1,'key6':2,'key7':1}
print sys.getsizeof(diii)
# output: 1048

diii = {'key1':1,'key2':2,'key3':1,'key4':2,'key5':1,'key6':2,'key7':1,'key8':2}
print sys.getsizeof(diii)
# output: 664  

Before asking here, I restarted my python shell and tried it online too and got the same result.
I thought a dictionary with one more element will either give same bytes as output or more, than the one containing one less element.

Any idea what am I doing wrong?

like image 442
Sir Nutcase Avatar asked May 26 '19 11:05

Sir Nutcase


1 Answers

Previous answers have already mentioned that you needn't worry, so I will dive into some more technical details. It's long, but please bear with me.

TLDR: this has to do with arithmetic of resizing. Each resize allocates 2**i memory, where 2**i > requested_size; 2**i >= 8, but then each insert resizes the underlying table further if 2/3 of slots are filled, but this time the new_size = old_size * 4. This way, your first dictionary ends up with 32 cells allocated while the second one with as little as 16 (as it got a bigger initial size upfront).

Answer: As @snakecharmerb noted in the comments this depends on the way that the dictionary is created. For the sake of brevity, let me refer you to this, excellent blog post which explains the differences between the dict() constructor and the dict literal {} on both Python bytecode and CPython implementation levels.

Let's start with the magic number of 8 keys. It turns out to be a constant, predefined for Python's 2.7 implementation in dictobject.h headers file - the minimal size of the Python dictionary:

/* PyDict_MINSIZE is the minimum size of a dictionary.  This many slots are
 * allocated directly in the dict object (in the ma_smalltable member).
 * It must be a power of 2, and at least 4.  8 allows dicts with no more
 * than 5 active entries to live in ma_smalltable (and so avoid an
 * additional malloc); instrumentation suggested this suffices for the
 * majority of dicts (consisting mostly of usually-small instance dicts and
 * usually-small dicts created to pass keyword arguments).
 */
#define PyDict_MINSIZE 8

As such, it may differ between the specific Python implementations, but let's assume that we all use the same CPython version. However, the dict of size 8 is expected to neatly contain only 5 elements; don't worry about this, as this specific optimization is not as important for us as it seems.

Now, when you create the dictionary using the dict literal {}, CPython takes a shortcut (as compared to the explicit creation when calling dict constructor). Simplifying a bit the bytecode operation BUILD_MAP gets resolved and it results in calling the _PyDict_NewPresized function which will construct a dictionary for which we already know the size upfront:

/* Create a new dictionary pre-sized to hold an estimated number of elements.
   Underestimates are okay because the dictionary will resize as necessary.
   Overestimates just mean the dictionary will be more sparse than usual.
*/

PyObject *
_PyDict_NewPresized(Py_ssize_t minused)
{
    PyObject *op = PyDict_New();

    if (minused>5 && op != NULL && dictresize((PyDictObject *)op, minused) == -1) {
        Py_DECREF(op);
        return NULL;
    }
    return op;
}

This function calls the normal dict constructor (PyDict_New) and requests a resize of the newly created dict - but only if it is expected to hold more than 5 elements. This is due to an optimization which allows Python to speed up some things by holding the data in the pre-allocated "smalltable", without invoking expensive memory allocation and de-allocation functions.

Then, the dictresize will try to determine the minimal size of the new dictionary. It will also use the magic number 8 - as the starting point and iteratively multiply by 2 until it finds the minimal size larger than the requested size. For the first dictionary, this is simply 8, however, for the second one (and all dictionaries created by dict literal with less than 15 keys) it is 16.

Now, in the dictresize function there is a special case for the former, smaller new_size == 8, which is meant to bring forward the aforementioned optimization (using the "small table" to reduce memory manipulation operations). However, because there is no need to resize the newly created dict (e.g. no elements were removed so far thus the table is "clean") nothing really happens.

On the contrary, when the new_size != 8, a usual procedure of reallocating the hash table follows. This ends up with a new table being allocated to store the "big" dictionary. While this is intuitive (the bigger dict got a bigger table), this does not seem to move us forward to the observed behavior yet - but, please bear with me one more moment.

Once we have the pre-allocated dict, STORE_MAP optcodes tell the interpreter to insert consecutive key-value pairs. This is implemented with dict_set_item_by_hash_or_entry function, which - importantly - resizes the dictionary after each increase in size (i.e. successful insertion) if more than 2/3 of the slots are already used up. The size will increase x4 (in our case, for large dicts only by x2).

So here is what happens when you create the dict with 7 elements:

# note 2/3 = 0.(6)
BUILD_MAP   # initial_size = 8, filled = 0
STORE_MAP   # 'key_1' ratio_filled = 1/8 = 0.125, not resizing
STORE_MAP   # 'key_2' ratio_filled = 2/8 = 0.250, not resizing
STORE_MAP   # 'key_3' ratio_filled = 3/8 = 0.375, not resizing
STORE_MAP   # 'key_4' ratio_filled = 4/8 = 0.500, not resizing
STORE_MAP   # 'key_5' ratio_filled = 5/8 = 0.625, not resizing
STORE_MAP   # 'key_6' ratio_filled = 6/8 = 0.750, RESIZING! new_size = 8*4 = 32
STORE_MAP   # 'key_7' ratio_filled = 7/32 = 0.21875

And you end up with a dict having a total size of 32 elements in the hash table.

However, when adding eight elements the initial size will be twice bigger (16), thus we will never resize as the condition ratio_filled > 2/3 will never be satisfied!

And that's why you end up with a smaller table in the second case.

like image 168
krassowski Avatar answered Oct 23 '22 15:10

krassowski