Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python list anomalous memory usage

I'm working on a project that involves accessing data from a large list that's kept in memory. Because the list is quite voluminous (millions of lines) I keep an eye on how much memory is being used. I use OS X so I keep Activity Monitor open as I create these lists.

I've noticed that the amount of memory used by a list can vary wildly depending on how it is constructed but I can't seem to figure out why.

Now for some example code:

(I am using Python 2.7.4 on OSX 10.8.3)

The first function below creates a list and fills it with all the same three random numbers.

The second function below creates a list and fills it with all different random numbers.

import random
import sys


def make_table1(size):
    list1 = size *[(float(),float(),float())] # initialize the list
    line = (random.random(), 
            random.random(), 
            random.random())
    for count in xrange(0, size): # Now fill it
        list1[count] = line
    return list1

def make_table2(size):
    list1 = size *[(float(),float(),float())] # initialize the list
    for count in xrange(0, size): # Now fill it
        list1[count] = (random.random(), 
                        random.random(), 
                        random.random())
    return list1

(First let me say that I realize the code above could have been written much more efficiently. It's written this way to keep the two examples as similar as possible.)

Now I create some lists using these functions:

In [2]: thing1 = make_table1(6000000)

In [3]: sys.getsizeof(thing1)
Out[3]: 48000072

At this point my memory used jumps by about 46 MB, which is what I would expect from the information given above.

Now for the next function:

In [4]: thing2 = make_table2(6000000)

In [5]: sys.getsizeof(thing2)
Out[5]: 48000072

As you can see, the memory taken up by the two lists is the same. They are exactly the same length so that's to be expected. What I didn't expect is that my memory used as given by Activity Monitor jumps to over 1 GB!

I understand there is going to be some overhead but 20x as much? 1 GB for a 46MB list?

Seriously?

Okay, on to diagnostics...

The first thing I tried is to collect any garbage:

In [5]: import gc

In [6]: gc.collect()
Out[6]: 0

It made zero difference to the amount of memory used.

Next I used guppy to see where the memory is going:

In [7]: from guppy import hpy

In [8]: hpy().heap()

Out[8]: 
Partition of a set of 24217689 objects. Total size = 1039012560 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 6054789  25 484821768  47 484821768  47 tuple
     1 18008261  74 432198264  42 917020032  88 float
     2   2267   0 96847576   9 1013867608  98 list
     3  99032   0 11392880   1 1025260488  99 str
     4    585   0  1963224   0 1027223712  99 dict of module
     5   1712   0  1799552   0 1029023264  99 dict (no owner)
     6  13606   0  1741568   0 1030764832  99 types.CodeType
     7  13355   0  1602600   0 1032367432  99 function
     8   1494   0  1348088   0 1033715520  99 type
     9   1494   0  1300752   0 1035016272 100 dict of type
<691 more rows. Type e.g. '_.more' to view.>

okay, my memory is taken up by:

462 MB of of tuple (huh?)

412 MB of float (what?)

92 MB of list (Okay, this one makes sense. 2*46MB = 92)

My lists are preallocated so I don't think that there is over-allocation going on.

Questions:

Why is the amount of memory used by these two very similar lists so different?

Is there a different way to populate a list that doesn't have so much overhead?

Is there a way to free up all that memory?

Note: Please don't suggest storing on the disk or using array.array or numpy or pandas data structures. Those are all great options but this question isn't about them. This question is about plain old lists.

I have tried similar code with Python 3.3 and the result is the same.

Here is someone with a similar problem. It contains some hints but it's not the same question.

Thank you all!

like image 738
jmorris0x0 Avatar asked May 11 '13 01:05

jmorris0x0


People also ask

How do I see RAM usage in Python?

The function psutil. virutal_memory() returns a named tuple about system memory usage. The third field in the tuple represents the percentage use of the memory(RAM). It is calculated by (total – available)/total * 100 .

Why is Python using so much memory?

Those numbers can easily fit in a 64-bit integer, so one would hope Python would store those million integers in no more than ~8MB: a million 8-byte objects. In fact, Python uses more like 35MB of RAM to store these numbers. Why? Because Python integers are objects, and objects have a lot of memory overhead.

How do you release memory in Python?

As a result, one may have to explicitly free up memory in Python. One way to do this is to force the Python garbage collector to release unused memory by making use of the gc module. One simply needs to run gc. collect() to do so.


1 Answers

Both functions make a list of 6000000 references.

sizeof(thelist) ≅ sizeof(reference_to_a_python_object) * 6000000

First list contains 6000000 references to the same one tuple of three floats.

Second list contains references to 6000000 different tuples containing 18000000 different floats.

enter image description here

As you can see, a float takes 24 bytes and a triple takes 80 bytes (using your build of python). No, there's no way around that except numpy.

To turn the lists into collectible garbage, you need to get rid of any references to them:

del thing1 
del thing2
like image 125
Pavel Anossov Avatar answered Sep 18 '22 23:09

Pavel Anossov