Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: where's the memory leak here?

Tags:

python

pandas

I face the problem of memory leaks using pandas library in python. I create pandas.dataframe objects in my class and I have method, that change dataframe size according my conditions. After changing dataframe size and creating new pandas object I rewrite original pandas.dataframe in my class. But memory usage is very high even after significally reducing of initial table. Some code for short example (I didn't write process manager,see task manager):

import time, string, pandas, numpy, gc
class temp_class ():

    def __init__(self, nrow = 1000000, ncol = 4, timetest = 5):

        self.nrow = nrow
        self.ncol = ncol
        self.timetest = timetest

    def createDataFrame(self):

        print('Check memory before dataframe creating')
        time.sleep(self.timetest)
        self.df = pandas.DataFrame(numpy.random.randn(self.nrow, self.ncol),
            index = numpy.random.randn(self.nrow), columns = list(string.letters[0:self.ncol]))
        print('Check memory after dataFrame creating')
        time.sleep(self.timetest)

    def changeSize(self, from_ = 0, to_ = 100):

        df_new = self.df[from_:to_].copy()
        print('Check memory after changing size')
        time.sleep(self.timetest)

        print('Check memory after deleting initial pandas object')
        del self.df
        time.sleep(self.timetest)

        print('Check memory after deleting copy of reduced pandas object')
        del df_new
        gc.collect()
        time.sleep(self.timetest)

if __name__== '__main__':

    a = temp_class()
    a.createDataFrame()
    a.changeSize()
  • Before dataframe creating I have approx. 15 mb of memory usage

  • After creating - 67mb

  • After changing size - 67 mb

  • After deleting original dataframe - 35mb

  • After deleting reduced table - 31 mb.

16 mb?

I use python 2.7.2(x32) on Windows 7 (x64) machine, pandas.version is 0.7.3. numpy.version is 1.6.1

like image 412
iron.arty Avatar asked May 15 '12 12:05

iron.arty


Video Answer


1 Answers

A couple things to point out:

  1. In "Check memory after changing size", you haven't deleted the original DataFrame yet, so this will be using strictly more memory

  2. The Python interpreter is a bit greedy about holding onto OS memory.

I looked into this and can assure you that pandas is not leaking memory. I'm using the memory_profiler (http://pypi.python.org/pypi/memory_profiler) package:

import time, string, pandas, numpy, gc
from memory_profiler import LineProfiler, show_results
import memory_profiler as mprof

prof = LineProfiler()

@prof
def test(nrow=1000000, ncol = 4, timetest = 5):
    from_ = nrow // 10
    to_ = 9 * nrow // 10
    df = pandas.DataFrame(numpy.random.randn(nrow, ncol),
                          index = numpy.random.randn(nrow),
                          columns = list(string.letters[0:ncol]))
    df_new = df[from_:to_].copy()
    del df
    del df_new
    gc.collect()

test()
# for _ in xrange(10):
#     print mprof.memory_usage()

show_results(prof)

And here's the output

10:15 ~/tmp $ python profmem.py 
Line #    Mem usage  Increment   Line Contents
==============================================
     7                           @prof
     8     28.77 MB    0.00 MB   def test(nrow=1000000, ncol = 4, timetest = 5):
     9     28.77 MB    0.00 MB       from_ = nrow // 10
    10     28.77 MB    0.00 MB       to_ = 9 * nrow // 10
    11     59.19 MB   30.42 MB       df = pandas.DataFrame(numpy.random.randn(nrow, ncol),
    12     66.77 MB    7.58 MB                             index = numpy.random.randn(nrow),
    13     90.46 MB   23.70 MB                             columns = list(string.letters[0:ncol]))
    14    114.96 MB   24.49 MB       df_new = df[from_:to_].copy()
    15    114.96 MB    0.00 MB       del df
    16     90.54 MB  -24.42 MB       del df_new
    17     52.39 MB  -38.15 MB       gc.collect()

So indeed, there is more memory in use than when we started. But is it leaking?

for _ in xrange(20):
    test()
    print mprof.memory_usage()

And output:

10:19 ~/tmp $ python profmem.py 
[52.3984375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59765625]
[122.59765625]
[122.59765625]

So actually what's gone on is that the Python process is holding on to a pool of memory given what it's been using to avoid having to keep requesting more memory (and then freeing it) from the host OS. I don't know all the technical details behind this, but that is at least what is going on.

like image 183
Wes McKinney Avatar answered Nov 15 '22 18:11

Wes McKinney