I have a program which repeatedly loops over a pandas data frame like below:
monts = [some months]
for month in months:
df = original_df[original_df.month == month].copy()
result = some_function(df)
print(result)
However, the memory which is required per iteration keeps to increase
types | # objects | total size
================================================ | =========== | ============
<class 'pandas.core.frame.DataFrame | 22 | 6.54 GB
<class 'pandas.core.series.Series | 1198 | 4.72 GB
<class 'numpy.ndarray | 1707 | 648.19 MB
<class 'pandas.core.categorical.Categorical | 238 | 368.90 MB
<class 'pandas.core.indexes.base.Index | 256 | 312.03 MB
================================================ | =========== | ============
<class 'pandas.core.frame.DataFrame | 30 | 9.04 GB
<class 'pandas.core.series.Series | 2262 | 7.29 GB
<class 'numpy.ndarray | 2958 | 834.49 MB
<class 'pandas.core.categorical.Categorical | 356 | 569.39 MB
<class 'pandas.core.indexes.base.Index | 380 | 481.21 MB
do you have some suggestions how to find the memory leak?
Note, manually calling gc.collect()
on each iteration does not help.
a minimal sample is here:
import pandas as pd
from numpy.random import randn
df = pd.DataFrame(randn(10000,3),columns=list('ABC'))
for i in range(10):
print(i)
something = df.copy()
print('#########################')
print('trying to limit memory pressure')
from pympler import muppy, summary
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
print('#########################')
As you see this is logging an increase in memory consumption. Starting with 9 MB at first after 10 iterations it is already using 30 MB.
Actually, the comment from @Steven might have a point
for i in range(10):
something = df.copy()
foo_thing = summary.summarize(muppy.get_objects())
summary.print_(foo_thing)
is showing the problem, whereas
for i in range(10):
something = df.copy()
summary.print_(summary.summarize(muppy.get_objects()))
is working fine. How could I find all of these variables which cause problems? I think this is especially important as in my real code some of these are some fairly large pandas.Dataframes
.
When manually adding a line of foo_thing = None
the other script is working fine as well. The question remains - how to efficiently find all such cases.
Shouldn't python identify the no longer used variable automatically?
when introducing a function like:
def do_some_stuff():
foo_thing = summary.summarize(muppy.get_objects())
summary.print_(foo_thing)
for i in range(10):
something = df.copy()
do_some_stuff()
the memory leak also seems to be fixed.
Actually, the memory leak is not fixed. The good thing is that the summary
now no longer reports drastically increasing memory consumption. The bad thing is: the task manager/activity monitor tells me otherwise - and the python program is crashing at some point.
To find a memory leak, look at how much RAM the system is using. The Resource Monitor in Windows can be used to accomplish this. In Windows 8.1 and Windows 10: To open the Run dialogue, press Windows+R, then type "resmon" and click OK.
The problem is with scoping. When you create a new object in the loop, it is supposed to be accessible when the loop ends. This is why (I assume), the garbage collector doesn't mark the objects created using copy
for garbage collection. When you create new objects inside a function, those objects are limited to the function scope and are NOT supposed to be available outside the function. That is why they are collected.
You mentioned that assigning foo_thing = None
solves the problem. It does so because by pointing foo_thing
to another object (None
) there is no longer a variable that refers to the data frame. I use a similar approach but instead of foo_thing = None
, I do del foo_thing
. After all, Explicit is better than implicit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With