Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The amount of memory a Python set spends increases in steps

Tags:

python

numpy

I do an experiment about how much memory each of Python array types spends, which is list, tuple, set, dict, np.array. Then i got the following result.

The amount of memory Python array types spend

(x axis is the length of array, y axis is the memory size.)

I found that the amount of memory a Python set spends increases in steps(also dict), while those of others increase linearly as i expected. I wonder what makes them different.

I used following get_size() function. (reference)

def get_size(obj, seen = None):
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    return size

And i measured the memory from length 0 to 10,000 in 100 intervals.

my code : https://repl.it/repls/WanEsteemedLines

like image 963
Ellisein Avatar asked Oct 16 '18 08:10

Ellisein


1 Answers

CPython sets and dicts always use a power-of-two-sized internal hash table. list, tuple, and numpy.ndarray all have more flexibility regarding the size of their underlying memory buffer, but set and dict are hardcoded to use power-of-two table sizes. The implementation cannot function with a non-power-of-two table size. See Objects/dictobject.c and Objects/setobject.c.

The jumps in your graph are when the table size jumps to a new power of two.

Incidentally, your get_size doesn't work very well. For example, it has two bugs affecting the numpy.ndarray case that almost cancel out (but don't quite). It tries to add the sizes of the elements of a NumPy array to the size of the whole array, but for a NumPy array, the sizes of the elements are already accounted for by getsizeof. Also, it's determining object identity with id, but the objects produced by iterating over a NumPy array are created on the fly and die immediately, so their id values aren't unique. In practice, this probably overcounts the size by one or two times the size of the objects representing array elements.

like image 138
user2357112 supports Monica Avatar answered Nov 02 '22 16:11

user2357112 supports Monica