Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does union consume more memory if the argument is a set?

I'm puzzled by this behaviour of memory allocation of sets:

>>> set(range(1000)).__sizeof__()
32968
>>> set(range(1000)).union(range(1000)).__sizeof__()       # expected, set doesn't change
32968
>>> set(range(1000)).union(list(range(1000))).__sizeof__() #expected, set doesn't change
32968
>>> set(range(1000)).union(set(range(1000))).__sizeof__()  # not expected
65736

Why using a set as argument doubles the amount of memory used by the resulting set? The result in both cases is identical to the original set:

>>> set(range(1000)) == set(range(1000)).union(range(1000)) == set(range(1000)).union(set(range(1000)))
True

Note that the same happens using a normal iterator:

>>> set(range(1000)).union(iter(list(range(1000)))).__sizeof__()
32968

And with the update method:

>>> a.update(range(1000))
>>> a.__sizeof__()
32968
>>> a.update(set(range(1000)))
>>> a.__sizeof__()
65736

At first I thought that it was because when union is called, it sees that the size of the other set is 1000 and thus decides to allocate enough memory to fit all the elements of both sets, but afterwards it uses only part of that memory, while in the iterator case it simply iterators over it and adds the elements one by one(which doesn't consume more memory since all the elements are already in the set).

But range is also a sequence, and so is the list in the first example.

>>> len(range(1000))
1000
>>> range(1000)[100]
100

So why this doesn't happen with range and list but only with set? Is there any design decision behind this or is it a bug?


Tested on python 2.7.3 and python 3.2.3 on linux 64 bit.

like image 419
Bakuriu Avatar asked Mar 04 '13 09:03

Bakuriu


1 Answers

In Python 2.7.3, set.union() delegates to a C function called set_update_internal(). The latter uses several different implementations depending on the Python type of its argument. This multiplicity of implementations is what explains the difference in behaviour between the tests you've conducted.

The implementation that's used when the argument is a set makes the following assumption documented in the code:

/* Do one big resize at the start, rather than
 * incrementally resizing as we insert new keys.  Expect
 * that there will be no (or few) overlapping keys.
 */

Clearly, the assumption of no (or few) overlapping keys is incorrect in your particular case. This is what results in the final set overallocating memory.

I am not sure I would call this a bug though. The implementer of set chose what to me looks like a reasonable tradeoff, and you've simply found yourself on the wrong side of that tradeoff.

The upside of the tradeoff is that in many cases the pre-allocation results in better performance:

In [20]: rhs = list(range(1000))

In [21]: %timeit set().union(rhs)
10000 loops, best of 3: 30 us per loop

In [22]: rhs = set(range(1000))

In [23]: %timeit set().union(rhs)
100000 loops, best of 3: 14 us per loop

Here, the set version is twice as fast, presumably because it doesn't repeatedly reallocate memory as it's adding elements from rhs.

If the overallocation is a deal-breaker, there's a number of ways to work around it, some of which you've already discovered.

like image 102
NPE Avatar answered Sep 28 '22 19:09

NPE