Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory usage: creating one big set vs merging many small sets

I used the %memit magic function to measure memory usage:

In [1]: %memit n = pow(10, 7); range(n)
peak memory: 568 MiB, increment: 272 MiB

In [2]: %memit n = pow(10, 7); set(xrange(n))
peak memory: 824 MiB, increment: 447 MiB

OK so there seems to be an intermediate step where xrange(n) is instantiated into a full list. But what if I cut my list into 10 sublists, and union them one by one? This would be more memory efficient, right?

In [3]: %memit n = pow(10, 7); reduce(set.union, (set(xrange(p, n, 10)) for p in range(10)))
peak memory: 1260 MiB, increment: 897 MiB

Well, that didn't go as expected. Why does the reduce approach consume more memory than set(xrange(n))?

like image 789
usual me Avatar asked Sep 05 '15 12:09

usual me


2 Answers

Per the docs, reduce is roughly equivalent to:

def reduce(function, iterable, initializer=None):
    it = iter(iterable)
    if initializer is None:
        try:
            initializer = next(it)
        except StopIteration:
            raise TypeError('reduce() of empty sequence with no initial value')
    accum_value = initializer
    for x in it:
        accum_value = function(accum_value, x)
    return accum_value

Iterating over the iterable, (set(xrange(p, n, 10)) for p in range(10)), requires about 447 MiB. You might think that since this iterable is a generator expression you'll save memory, but integers are saved in an internal free list:

“For speed”, Python maintains an internal free list for integer objects. Unfortunately, that free list is both immortal and unbounded in size.

So once each set has been instantiated, most of the memory it consumes is never freed.

The returned value accum_value also requires about 447 MiB. Together, the call to reduce thus requires roughly 447+447 = 894 MiB.

like image 100
unutbu Avatar answered Oct 20 '22 11:10

unutbu


There are lots of misconceptions to be cleared. First of all, a xrange is not turned into a list before it is made into a set in set(xrange(n))!

The peak memory for the reduce approach comes from the fact that set.union returns a new value that is an union of the 2 result sets. And reduce internally stores an extra value, the accum_value. So on the last iteration of that for loop:

accum_value = function(accum_value, x)

The accum_value that goes into the function contains 90 % of 10⁷ elements, x contains 10 % of 10⁷ elements, but the return value will need space for all 10⁷ elements. Space for all these need to be reserved simultaneously. Only after the function has returned is the memory for the old accum_value and x released.


However this is not the end yet. The peak memory does tell how much memory was needed in the process, but not how much is in use after the operation completed if the set was still in existence.

Thus a different benchmark is needed:

In [1]: %memit n = pow(10, 7); result = set(xrange(n));
peak memory: 522.22 MiB, increment: 498.93 MiB
In [2]: %memit 42
peak memory: 513.83 MiB, increment: 0.12 MiB
In [3]: import sys
In [4]: sys.getsizeof(result)
Out[4]: 268435688

and

In [1]: %memit n = pow(10, 7); result = reduce(set.union, (set(xrange(p, n, 10)) for p in range(10)));
peak memory: 1063.20 MiB, increment: 1039.71 MiB
In [2]: %memit 42
peak memory: 779.77 MiB, increment: 0.00 MiB
In [3]: import sys    
In [4]: sys.getsizeof(result)
Out[4]: 536871144
In [5]: 536871144.0 / 268435688
Out[5]: 1.9999991357333977

So the memory usage for the reduce drops to 778 MiB after the execution; however it is still more than for the simpler set-constructing case. And as you can see, the set(xrange(n)) does not need much of internal memory, as the memory usage drops by mere 9 MiB after the set has been constructed.

Most notably not only is the reduce approach slower; the resulting set also consumes twice as much memory. This is because a set is backed by a hashtable, and it is grown into larger whenever it is deemed that it has too many collisions. You've met pathological behaviour where the set of same values result in one taking twice as much memory as the other.