I'm working on a project that involves accessing data from a large list that's kept in memory. Because the list is quite voluminous (millions of lines) I keep an eye on how much memory is being used. I use OS X so I keep Activity Monitor open as I create these lists. I've noticed that the amount of memory used by a list can vary wildly depending on how it is constructed but I can't seem to figure out why. Now for some example code: (I am using Python 2.7.4 on OSX 10.8.3) The first function below creates a list and fills it with all the same three random numbers. The second function below creates a list and fills it with all different random numbers. <pre class="prettyprint"><code>import random import sys def make_table1(size): list1 = size *[(float(),float(),float())] # initialize the list line = (random.random(), random.random(), random.random()) for count in xrange(0, size): # Now fill it list1[count] = line return list1 def make_table2(size): list1 = size *[(float(),float(),float())] # initialize the list for count in xrange(0, size): # Now fill it list1[count] = (random.random(), random.random(), random.random()) return list1 </code></pre> (First let me say that I realize the code above could have been written much more efficiently. It's written this way to keep the two examples as similar as possible.) Now I create some lists using these functions: <pre class="prettyprint"><code>In [2]: thing1 = make_table1(6000000) In [3]: sys.getsizeof(thing1) Out[3]: 48000072 </code></pre> At this point my memory used jumps by about 46 MB, which is what I would expect from the information given above. Now for the next function: <pre class="prettyprint"><code>In [4]: thing2 = make_table2(6000000) In [5]: sys.getsizeof(thing2) Out[5]: 48000072 </code></pre> As you can see, the memory taken up by the two lists is the same. They are exactly the same length so that's to be expected. What I didn't expect is that my memory used as given by Activity Monitor jumps to over 1 GB! I understand there is going to be some overhead but 20x as much? 1 GB for a 46MB list? Seriously? Okay, on to diagnostics... The first thing I tried is to collect any garbage: <pre class="prettyprint"><code>In [5]: import gc In [6]: gc.collect() Out[6]: 0 </code></pre> It made zero difference to the amount of memory used. Next I used guppy to see where the memory is going: <pre class="prettyprint"><code>In [7]: from guppy import hpy In [8]: hpy().heap() Out[8]: Partition of a set of 24217689 objects. Total size = 1039012560 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 6054789 25 484821768 47 484821768 47 tuple 1 18008261 74 432198264 42 917020032 88 float 2 2267 0 96847576 9 1013867608 98 list 3 99032 0 11392880 1 1025260488 99 str 4 585 0 1963224 0 1027223712 99 dict of module 5 1712 0 1799552 0 1029023264 99 dict (no owner) 6 13606 0 1741568 0 1030764832 99 types.CodeType 7 13355 0 1602600 0 1032367432 99 function 8 1494 0 1348088 0 1033715520 99 type 9 1494 0 1300752 0 1035016272 100 dict of type <691 more rows. Type e.g. '_.more' to view.> </code></pre> okay, my memory is taken up by: 462 MB of of tuple (huh?) 412 MB of float (what?) 92 MB of list (Okay, this one makes sense. 2*46MB = 92) My lists are preallocated so I don't think that there is over-allocation going on. Questions: Why is the amount of memory used by these two very similar lists so different? Is there a different way to populate a list that doesn't have so much overhead? Is there a way to free up all that memory? Note: Please don't suggest storing on the disk or using array.array or numpy or pandas data structures. Those are all great options but this question isn't about them. This question is about plain old lists. I have tried similar code with Python 3.3 and the result is the same. Here is someone with a similar problem. It contains some hints but it's not the same question. Thank you all!

Both functions make a list of 6000000 references. <pre class="prettyprint"><code>sizeof(thelist) &cong; sizeof(reference_to_a_python_object) * 6000000 </code></pre> First list contains 6000000 references to the same one tuple of three floats. Second list contains references to 6000000 different tuples containing 18000000 different floats. <img src="https://i.stack.imgur.com/zbHP5.png" alt="enter image description here"> As you can see, a float takes 24 bytes and a triple takes 80 bytes (using your build of python). No, there's no way around that except numpy. To turn the lists into collectible garbage, you need to get rid of any references to them: <pre class="prettyprint"><code>del thing1 del thing2 </code></pre>

Python list anomalous memory usage

Tags:

python

list

memory

I'm working on a project that involves accessing data from a large list that's kept in memory. Because the list is quite voluminous (millions of lines) I keep an eye on how much memory is being used. I use OS X so I keep Activity Monitor open as I create these lists.

I've noticed that the amount of memory used by a list can vary wildly depending on how it is constructed but I can't seem to figure out why.

Now for some example code:

(I am using Python 2.7.4 on OSX 10.8.3)

The first function below creates a list and fills it with all the same three random numbers.

The second function below creates a list and fills it with all different random numbers.

import random
import sys


def make_table1(size):
    list1 = size *[(float(),float(),float())] # initialize the list
    line = (random.random(), 
            random.random(), 
            random.random())
    for count in xrange(0, size): # Now fill it
        list1[count] = line
    return list1

def make_table2(size):
    list1 = size *[(float(),float(),float())] # initialize the list
    for count in xrange(0, size): # Now fill it
        list1[count] = (random.random(), 
                        random.random(), 
                        random.random())
    return list1

(First let me say that I realize the code above could have been written much more efficiently. It's written this way to keep the two examples as similar as possible.)

Now I create some lists using these functions:

In [2]: thing1 = make_table1(6000000)

In [3]: sys.getsizeof(thing1)
Out[3]: 48000072

At this point my memory used jumps by about 46 MB, which is what I would expect from the information given above.

Now for the next function:

In [4]: thing2 = make_table2(6000000)

In [5]: sys.getsizeof(thing2)
Out[5]: 48000072

As you can see, the memory taken up by the two lists is the same. They are exactly the same length so that's to be expected. What I didn't expect is that my memory used as given by Activity Monitor jumps to over 1 GB!

I understand there is going to be some overhead but 20x as much? 1 GB for a 46MB list?

Seriously?

Okay, on to diagnostics...

The first thing I tried is to collect any garbage:

In [5]: import gc

In [6]: gc.collect()
Out[6]: 0

It made zero difference to the amount of memory used.

Next I used guppy to see where the memory is going:

In [7]: from guppy import hpy

In [8]: hpy().heap()

Out[8]: 
Partition of a set of 24217689 objects. Total size = 1039012560 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 6054789  25 484821768  47 484821768  47 tuple
     1 18008261  74 432198264  42 917020032  88 float
     2   2267   0 96847576   9 1013867608  98 list
     3  99032   0 11392880   1 1025260488  99 str
     4    585   0  1963224   0 1027223712  99 dict of module
     5   1712   0  1799552   0 1029023264  99 dict (no owner)
     6  13606   0  1741568   0 1030764832  99 types.CodeType
     7  13355   0  1602600   0 1032367432  99 function
     8   1494   0  1348088   0 1033715520  99 type
     9   1494   0  1300752   0 1035016272 100 dict of type
<691 more rows. Type e.g. '_.more' to view.>

okay, my memory is taken up by:

462 MB of of tuple (huh?)

412 MB of float (what?)

92 MB of list (Okay, this one makes sense. 2*46MB = 92)

My lists are preallocated so I don't think that there is over-allocation going on.

Questions:

Why is the amount of memory used by these two very similar lists so different?

Is there a different way to populate a list that doesn't have so much overhead?

Is there a way to free up all that memory?

Note: Please don't suggest storing on the disk or using array.array or numpy or pandas data structures. Those are all great options but this question isn't about them. This question is about plain old lists.

I have tried similar code with Python 3.3 and the result is the same.

Here is someone with a similar problem. It contains some hints but it's not the same question.

Thank you all!

738

asked May 11 '13 01:05

jmorris0x0

1 Answers

Both functions make a list of 6000000 references.

sizeof(thelist) ≅ sizeof(reference_to_a_python_object) * 6000000

First list contains 6000000 references to the same one tuple of three floats.

Second list contains references to 6000000 different tuples containing 18000000 different floats.

enter image description here

As you can see, a float takes 24 bytes and a triple takes 80 bytes (using your build of python). No, there's no way around that except numpy.

To turn the lists into collectible garbage, you need to get rid of any references to them:

del thing1 
del thing2

125

answered Sep 18 '22 23:09

Pavel Anossov

Related questions
                            
                                Analogue of Python's OrderedDict?
                            
                                Correct usage of os.path and os.join
                            
                                How to do nonlinear complex root finding in Python
                            
                                How to parse html table with python and beautifulsoup and write to csv
                            
                                Detect if text in English with python [closed]
                            
                                Numpy Array Broadcasting with different dimensions
                            
                                Cython: unsigned int indices for numpy arrays gives different result
                            
                                How to deploy Flask+ Python application on Windows Azure?
                            
                                Are python "global" (module) variables thread local?
                            
                                OpenCV (cv2 in Python) VideoCapture not releasing camera after deletion
                            
                                PIL jpeg, how to preserve the pixel color
                            
                                sqlalchemy and double outerjoin
                            
                                Truncated versus floored division in Python
                            
                                How do you initialize a gensim corpus variable with a csr_matrix?
                            
                                Converting a list in a dict to a Series
                            
                                How to handle the encode in lxml to parse html-string properly?
                            
                                How to flexibly change PYTHONPATH
                            
                                Represent a class as a dict or list
                            
                                flask application timeout with amazon load balancer
                            
                                Variable scope in nested functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With