How to find memory leak with pandas

Tags:

I have a program which repeatedly loops over a pandas data frame like below:

monts = [some months]

for month in months:
  df = original_df[original_df.month == month].copy()
  result = some_function(df)
  print(result)

However, the memory which is required per iteration keeps to increase

                                           types |   # objects |   total size
================================================ | =========== | ============
             <class 'pandas.core.frame.DataFrame |          22 |      6.54 GB
               <class 'pandas.core.series.Series |        1198 |      4.72 GB
                           <class 'numpy.ndarray |        1707 |    648.19 MB
     <class 'pandas.core.categorical.Categorical |         238 |    368.90 MB
          <class 'pandas.core.indexes.base.Index |         256 |    312.03 MB

================================================ | =========== | ============
             <class 'pandas.core.frame.DataFrame |          30 |      9.04 GB
               <class 'pandas.core.series.Series |        2262 |      7.29 GB
                           <class 'numpy.ndarray |        2958 |    834.49 MB
     <class 'pandas.core.categorical.Categorical |         356 |    569.39 MB
          <class 'pandas.core.indexes.base.Index |         380 |    481.21 MB

do you have some suggestions how to find the memory leak?

edit

Note, manually calling gc.collect()on each iteration does not help.

edit 2

a minimal sample is here:

import pandas as pd
from numpy.random import randn
 df = pd.DataFrame(randn(10000,3),columns=list('ABC'))
for i in range(10):
    print(i)
    something = df.copy()
    print('#########################')
    print('trying to limit memory pressure')
    from pympler import muppy, summary
    all_objects = muppy.get_objects()
    sum1 = summary.summarize(all_objects)
    summary.print_(sum1)
    print('#########################')

As you see this is logging an increase in memory consumption. Starting with 9 MB at first after 10 iterations it is already using 30 MB.

edit 3

Actually, the comment from @Steven might have a point

for i in range(10):
    something = df.copy()
    foo_thing = summary.summarize(muppy.get_objects())
    summary.print_(foo_thing)

is showing the problem, whereas

for i in range(10):
    something = df.copy()
    summary.print_(summary.summarize(muppy.get_objects()))

is working fine. How could I find all of these variables which cause problems? I think this is especially important as in my real code some of these are some fairly large pandas.Dataframes.

edit 4

When manually adding a line of foo_thing = None the other script is working fine as well. The question remains - how to efficiently find all such cases. Shouldn't python identify the no longer used variable automatically?

edit 5

when introducing a function like:

def do_some_stuff():
    foo_thing = summary.summarize(muppy.get_objects())
    summary.print_(foo_thing)

for i in range(10):
    something = df.copy()
    do_some_stuff()

the memory leak also seems to be fixed.

edit 6

Actually, the memory leak is not fixed. The good thing is that the summary now no longer reports drastically increasing memory consumption. The bad thing is: the task manager/activity monitor tells me otherwise - and the python program is crashing at some point.

275

asked Jul 24 '17 18:07

Georg Heiler

1 Answers

The problem is with scoping. When you create a new object in the loop, it is supposed to be accessible when the loop ends. This is why (I assume), the garbage collector doesn't mark the objects created using copy for garbage collection. When you create new objects inside a function, those objects are limited to the function scope and are NOT supposed to be available outside the function. That is why they are collected.

You mentioned that assigning foo_thing = None solves the problem. It does so because by pointing foo_thing to another object (None) there is no longer a variable that refers to the data frame. I use a similar approach but instead of foo_thing = None, I do del foo_thing. After all, Explicit is better than implicit.

answered Oct 21 '22 12:10

Boris Gorelik

Related questions
                            
                                vision-client doesn't supprt api-key
                            
                                GIT bash on Windows: VIM doesn't find Python3
                            
                                Python - Delete (remove from memory) a variable from inside a function?
                            
                                Jupyter, Python, %Debug: Setting breakpoint in module doesn't work
                            
                                ServerNotFoundError: Unable to find the server at accounts.google.com
                            
                                Override a read-only property with a read-only Column that gets the same value
                            
                                How to enable dynamic requirements in Luigi?
                            
                                How can you update a pyfile in the middle of a PySpark shell session?
                            
                                how to make pip respect the "CC" environment variable
                            
                                A Python script, a proxy and Microsoft Forefront - Auto-Authentication
                            
                                PEP-8: module at top of file
                            
                                Why can't I "deactivate" pyenv / virtualenv? How to "fix" installation
                            
                                Cannot find ODBC driver in AWS Lambda when using unixODBC
                            
                                Why is Twisted's adbapi failing to recover data from within unittests?
                            
                                sqlalchemy, filter a json column containing an array [duplicate]
                            
                                Using spyder with virtualenv
                            
                                A dictionary with a unique possible value for each key?
                            
                                Password protect a SPECIFIC Jupyter notebook
                            
                                How do I know if tensorflow using cuda and cudnn or not?
                            
                                Delay load python DLL when embedding python+numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find memory leak with pandas

Tags:

python

memory-leaks

pandas

numpy