Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find memory leak with pandas

I have a program which repeatedly loops over a pandas data frame like below:

monts = [some months]

for month in months:
  df = original_df[original_df.month == month].copy()
  result = some_function(df)
  print(result)

However, the memory which is required per iteration keeps to increase

                                           types |   # objects |   total size
================================================ | =========== | ============
             <class 'pandas.core.frame.DataFrame |          22 |      6.54 GB
               <class 'pandas.core.series.Series |        1198 |      4.72 GB
                           <class 'numpy.ndarray |        1707 |    648.19 MB
     <class 'pandas.core.categorical.Categorical |         238 |    368.90 MB
          <class 'pandas.core.indexes.base.Index |         256 |    312.03 MB

================================================ | =========== | ============
             <class 'pandas.core.frame.DataFrame |          30 |      9.04 GB
               <class 'pandas.core.series.Series |        2262 |      7.29 GB
                           <class 'numpy.ndarray |        2958 |    834.49 MB
     <class 'pandas.core.categorical.Categorical |         356 |    569.39 MB
          <class 'pandas.core.indexes.base.Index |         380 |    481.21 MB

do you have some suggestions how to find the memory leak?

edit

Note, manually calling gc.collect()on each iteration does not help.

edit 2

a minimal sample is here:

import pandas as pd
from numpy.random import randn
 df = pd.DataFrame(randn(10000,3),columns=list('ABC'))
for i in range(10):
    print(i)
    something = df.copy()
    print('#########################')
    print('trying to limit memory pressure')
    from pympler import muppy, summary
    all_objects = muppy.get_objects()
    sum1 = summary.summarize(all_objects)
    summary.print_(sum1)
    print('#########################')

As you see this is logging an increase in memory consumption. Starting with 9 MB at first after 10 iterations it is already using 30 MB.

edit 3

Actually, the comment from @Steven might have a point

for i in range(10):
    something = df.copy()
    foo_thing = summary.summarize(muppy.get_objects())
    summary.print_(foo_thing)

is showing the problem, whereas

for i in range(10):
    something = df.copy()
    summary.print_(summary.summarize(muppy.get_objects()))

is working fine. How could I find all of these variables which cause problems? I think this is especially important as in my real code some of these are some fairly large pandas.Dataframes.

edit 4

When manually adding a line of foo_thing = None the other script is working fine as well. The question remains - how to efficiently find all such cases. Shouldn't python identify the no longer used variable automatically?

edit 5

when introducing a function like:

def do_some_stuff():
    foo_thing = summary.summarize(muppy.get_objects())
    summary.print_(foo_thing)

for i in range(10):
    something = df.copy()
    do_some_stuff()

the memory leak also seems to be fixed.

edit 6

Actually, the memory leak is not fixed. The good thing is that the summary now no longer reports drastically increasing memory consumption. The bad thing is: the task manager/activity monitor tells me otherwise - and the python program is crashing at some point.

like image 275
Georg Heiler Avatar asked Jul 24 '17 18:07

Georg Heiler


People also ask

How do you detect memory leaks?

To find a memory leak, look at how much RAM the system is using. The Resource Monitor in Windows can be used to accomplish this. In Windows 8.1 and Windows 10: To open the Run dialogue, press Windows+R, then type "resmon" and click OK.


1 Answers

The problem is with scoping. When you create a new object in the loop, it is supposed to be accessible when the loop ends. This is why (I assume), the garbage collector doesn't mark the objects created using copy for garbage collection. When you create new objects inside a function, those objects are limited to the function scope and are NOT supposed to be available outside the function. That is why they are collected.

You mentioned that assigning foo_thing = None solves the problem. It does so because by pointing foo_thing to another object (None) there is no longer a variable that refers to the data frame. I use a similar approach but instead of foo_thing = None, I do del foo_thing. After all, Explicit is better than implicit.

like image 94
Boris Gorelik Avatar answered Oct 21 '22 12:10

Boris Gorelik