I'm working on a project that involves accessing data from a large list that's kept in memory. Because the list is quite voluminous (millions of lines) I keep an eye on how much memory is being used. I use OS X so I keep Activity Monitor open as I create these lists.
I've noticed that the amount of memory used by a list can vary wildly depending on how it is constructed but I can't seem to figure out why.
Now for some example code:
(I am using Python 2.7.4 on OSX 10.8.3)
The first function below creates a list and fills it with all the same three random numbers.
The second function below creates a list and fills it with all different random numbers.
import random
import sys
def make_table1(size):
list1 = size *[(float(),float(),float())] # initialize the list
line = (random.random(),
random.random(),
random.random())
for count in xrange(0, size): # Now fill it
list1[count] = line
return list1
def make_table2(size):
list1 = size *[(float(),float(),float())] # initialize the list
for count in xrange(0, size): # Now fill it
list1[count] = (random.random(),
random.random(),
random.random())
return list1
(First let me say that I realize the code above could have been written much more efficiently. It's written this way to keep the two examples as similar as possible.)
Now I create some lists using these functions:
In [2]: thing1 = make_table1(6000000)
In [3]: sys.getsizeof(thing1)
Out[3]: 48000072
At this point my memory used jumps by about 46 MB, which is what I would expect from the information given above.
Now for the next function:
In [4]: thing2 = make_table2(6000000)
In [5]: sys.getsizeof(thing2)
Out[5]: 48000072
As you can see, the memory taken up by the two lists is the same. They are exactly the same length so that's to be expected. What I didn't expect is that my memory used as given by Activity Monitor jumps to over 1 GB!
I understand there is going to be some overhead but 20x as much? 1 GB for a 46MB list?
Seriously?
Okay, on to diagnostics...
The first thing I tried is to collect any garbage:
In [5]: import gc
In [6]: gc.collect()
Out[6]: 0
It made zero difference to the amount of memory used.
Next I used guppy to see where the memory is going:
In [7]: from guppy import hpy
In [8]: hpy().heap()
Out[8]:
Partition of a set of 24217689 objects. Total size = 1039012560 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 6054789 25 484821768 47 484821768 47 tuple
1 18008261 74 432198264 42 917020032 88 float
2 2267 0 96847576 9 1013867608 98 list
3 99032 0 11392880 1 1025260488 99 str
4 585 0 1963224 0 1027223712 99 dict of module
5 1712 0 1799552 0 1029023264 99 dict (no owner)
6 13606 0 1741568 0 1030764832 99 types.CodeType
7 13355 0 1602600 0 1032367432 99 function
8 1494 0 1348088 0 1033715520 99 type
9 1494 0 1300752 0 1035016272 100 dict of type
<691 more rows. Type e.g. '_.more' to view.>
okay, my memory is taken up by:
462 MB of of tuple (huh?)
412 MB of float (what?)
92 MB of list (Okay, this one makes sense. 2*46MB = 92)
My lists are preallocated so I don't think that there is over-allocation going on.
Questions:
Why is the amount of memory used by these two very similar lists so different?
Is there a different way to populate a list that doesn't have so much overhead?
Is there a way to free up all that memory?
Note: Please don't suggest storing on the disk or using array.array or numpy or pandas data structures. Those are all great options but this question isn't about them. This question is about plain old lists.
I have tried similar code with Python 3.3 and the result is the same.
Here is someone with a similar problem. It contains some hints but it's not the same question.
Thank you all!
The function psutil. virutal_memory() returns a named tuple about system memory usage. The third field in the tuple represents the percentage use of the memory(RAM). It is calculated by (total – available)/total * 100 .
Those numbers can easily fit in a 64-bit integer, so one would hope Python would store those million integers in no more than ~8MB: a million 8-byte objects. In fact, Python uses more like 35MB of RAM to store these numbers. Why? Because Python integers are objects, and objects have a lot of memory overhead.
As a result, one may have to explicitly free up memory in Python. One way to do this is to force the Python garbage collector to release unused memory by making use of the gc module. One simply needs to run gc. collect() to do so.
Both functions make a list of 6000000 references.
sizeof(thelist) ≅ sizeof(reference_to_a_python_object) * 6000000
First list contains 6000000 references to the same one tuple of three floats.
Second list contains references to 6000000 different tuples containing 18000000 different floats.
As you can see, a float takes 24 bytes and a triple takes 80 bytes (using your build of python). No, there's no way around that except numpy.
To turn the lists into collectible garbage, you need to get rid of any references to them:
del thing1
del thing2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With