Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-monotonic memory consumption in Python2 dictionaries

Can somebody explain this non-monotonic memory usage of a dictionary in CPython 2.7?

>>> import sys
>>> sys.getsizeof({})
280
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5})
280
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6})
1048
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7})
1048
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'e
ight': 8})
664
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'e
ight': 8, 'nine': 9})
664

Python3 is reasonable here, it prints the size of {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7} as 480.

I tried this on Ubuntu 15.10 and OS X 10.11.

like image 435
Dmytro Sirenko Avatar asked Apr 30 '16 10:04

Dmytro Sirenko


1 Answers

TLDR: The 6- and 7-entry dict literals presize the hash table badly and then quadruple the size on resize.


When CPython 2.7 evaluates a dict literal, before it starts filling in entries, the opcode it uses to create the dict is BUILD_MAP. This takes one argument, a hint for how many entries the dict will contain, which it uses to presize the dict:

    TARGET(BUILD_MAP)
    {
        x = _PyDict_NewPresized((Py_ssize_t)oparg);
        PUSH(x);
        if (x != NULL) DISPATCH();
        break;
    }

This is intended to minimize the number of times the dict is resized during creation, but since they didn't account for the load factor, it doesn't quite eliminate resizes.

As the source code comments indicate, _PyDict_NewPresized is intended to "Create a new dictionary pre-sized to hold an estimated number of elements". The exact size of the hash table in the created dict is influenced by a number of implementation details, such as the minimum size (#define PyDict_MINSIZE 8) and the requirement that the size be a power of 2 (to avoid needing division in the implementation).

For dict literals up to 7 entries, _PyDict_NewPresized initializes an 8-entry hash table; for 8 entries, it initializes a 16-entry hash table, since the resize routine it uses always picks a capacity bigger than the argument.


Dicts resize on insertion when they become at least 2/3 full. For the 6- and 7-entry dict literals, the dict starts off with 8 entries, so a resize occurs on the 6th insertion. The dict is small enough that the resize quadruples the size of the hash table:

return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);

mp->ma_used is the number of used entries in the hash table, 6 at this point. 6 is less than 50000, so we call dictresize(mp, 4 * 6), which resizes the hash table to 32 entries, the smallest power of 2 greater than 24.

In contrast, for the 8-entry dict literal, the hash table started off with 16 entries. The dict doesn't become 2/3 full during creation, so the initial 16-entry hash table survives the dict creation, and the resulting dict is smaller than with the 6- and 7-entry dict literals.


Python 3 uses a different growth policy, among other dict implementation changes, which is why you saw different results in Python 3.

like image 97
user2357112 supports Monica Avatar answered Sep 19 '22 08:09

user2357112 supports Monica