I'm puzzled by this behaviour of memory allocation of set
s:
>>> set(range(1000)).__sizeof__()
32968
>>> set(range(1000)).union(range(1000)).__sizeof__() # expected, set doesn't change
32968
>>> set(range(1000)).union(list(range(1000))).__sizeof__() #expected, set doesn't change
32968
>>> set(range(1000)).union(set(range(1000))).__sizeof__() # not expected
65736
Why using a set
as argument doubles the amount of memory used by the resulting set
?
The result in both cases is identical to the original set
:
>>> set(range(1000)) == set(range(1000)).union(range(1000)) == set(range(1000)).union(set(range(1000)))
True
Note that the same happens using a normal iterator:
>>> set(range(1000)).union(iter(list(range(1000)))).__sizeof__()
32968
And with the update
method:
>>> a.update(range(1000))
>>> a.__sizeof__()
32968
>>> a.update(set(range(1000)))
>>> a.__sizeof__()
65736
At first I thought that it was because when union
is called, it sees that the size of the other set
is 1000
and thus decides to allocate enough memory to fit all the elements of both set
s, but afterwards it uses only part of that memory, while in the iterator case it simply iterators over it and adds the elements one by one(which doesn't consume more memory since all the elements are already in the set
).
But range
is also a sequence, and so is the list
in the first example.
>>> len(range(1000))
1000
>>> range(1000)[100]
100
So why this doesn't happen with range
and list
but only with set
?
Is there any design decision behind this or is it a bug?
Tested on python 2.7.3 and python 3.2.3 on linux 64 bit.
In Python 2.7.3, set.union()
delegates to a C function called set_update_internal()
. The latter uses several different implementations depending on the Python type of its argument. This multiplicity of implementations is what explains the difference in behaviour between the tests you've conducted.
The implementation that's used when the argument is a set
makes the following assumption documented in the code:
/* Do one big resize at the start, rather than
* incrementally resizing as we insert new keys. Expect
* that there will be no (or few) overlapping keys.
*/
Clearly, the assumption of no (or few) overlapping keys is incorrect in your particular case. This is what results in the final set
overallocating memory.
I am not sure I would call this a bug though. The implementer of set
chose what to me looks like a reasonable tradeoff, and you've simply found yourself on the wrong side of that tradeoff.
The upside of the tradeoff is that in many cases the pre-allocation results in better performance:
In [20]: rhs = list(range(1000))
In [21]: %timeit set().union(rhs)
10000 loops, best of 3: 30 us per loop
In [22]: rhs = set(range(1000))
In [23]: %timeit set().union(rhs)
100000 loops, best of 3: 14 us per loop
Here, the set
version is twice as fast, presumably because it doesn't repeatedly reallocate memory as it's adding elements from rhs
.
If the overallocation is a deal-breaker, there's a number of ways to work around it, some of which you've already discovered.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With