I face the problem of memory leaks using pandas
library in python. I create pandas.dataframe
objects in my class and I have method, that change dataframe size according my conditions. After changing dataframe size and creating new pandas object I rewrite original pandas.dataframe in my class. But memory usage is very high even after significally reducing of initial table. Some code for short example (I didn't write process manager,see task manager):
import time, string, pandas, numpy, gc
class temp_class ():
def __init__(self, nrow = 1000000, ncol = 4, timetest = 5):
self.nrow = nrow
self.ncol = ncol
self.timetest = timetest
def createDataFrame(self):
print('Check memory before dataframe creating')
time.sleep(self.timetest)
self.df = pandas.DataFrame(numpy.random.randn(self.nrow, self.ncol),
index = numpy.random.randn(self.nrow), columns = list(string.letters[0:self.ncol]))
print('Check memory after dataFrame creating')
time.sleep(self.timetest)
def changeSize(self, from_ = 0, to_ = 100):
df_new = self.df[from_:to_].copy()
print('Check memory after changing size')
time.sleep(self.timetest)
print('Check memory after deleting initial pandas object')
del self.df
time.sleep(self.timetest)
print('Check memory after deleting copy of reduced pandas object')
del df_new
gc.collect()
time.sleep(self.timetest)
if __name__== '__main__':
a = temp_class()
a.createDataFrame()
a.changeSize()
Before dataframe creating I have approx. 15 mb of memory usage
After creating - 67mb
After changing size - 67 mb
After deleting original dataframe - 35mb
After deleting reduced table - 31 mb.
16 mb?
I use python 2.7.2(x32) on Windows 7 (x64) machine, pandas.version is 0.7.3. numpy.version is 1.6.1
A couple things to point out:
In "Check memory after changing size", you haven't deleted the original DataFrame yet, so this will be using strictly more memory
The Python interpreter is a bit greedy about holding onto OS memory.
I looked into this and can assure you that pandas is not leaking memory. I'm using the memory_profiler (http://pypi.python.org/pypi/memory_profiler) package:
import time, string, pandas, numpy, gc
from memory_profiler import LineProfiler, show_results
import memory_profiler as mprof
prof = LineProfiler()
@prof
def test(nrow=1000000, ncol = 4, timetest = 5):
from_ = nrow // 10
to_ = 9 * nrow // 10
df = pandas.DataFrame(numpy.random.randn(nrow, ncol),
index = numpy.random.randn(nrow),
columns = list(string.letters[0:ncol]))
df_new = df[from_:to_].copy()
del df
del df_new
gc.collect()
test()
# for _ in xrange(10):
# print mprof.memory_usage()
show_results(prof)
And here's the output
10:15 ~/tmp $ python profmem.py
Line # Mem usage Increment Line Contents
==============================================
7 @prof
8 28.77 MB 0.00 MB def test(nrow=1000000, ncol = 4, timetest = 5):
9 28.77 MB 0.00 MB from_ = nrow // 10
10 28.77 MB 0.00 MB to_ = 9 * nrow // 10
11 59.19 MB 30.42 MB df = pandas.DataFrame(numpy.random.randn(nrow, ncol),
12 66.77 MB 7.58 MB index = numpy.random.randn(nrow),
13 90.46 MB 23.70 MB columns = list(string.letters[0:ncol]))
14 114.96 MB 24.49 MB df_new = df[from_:to_].copy()
15 114.96 MB 0.00 MB del df
16 90.54 MB -24.42 MB del df_new
17 52.39 MB -38.15 MB gc.collect()
So indeed, there is more memory in use than when we started. But is it leaking?
for _ in xrange(20):
test()
print mprof.memory_usage()
And output:
10:19 ~/tmp $ python profmem.py
[52.3984375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59375]
[122.59765625]
[122.59765625]
[122.59765625]
So actually what's gone on is that the Python process is holding on to a pool of memory given what it's been using to avoid having to keep requesting more memory (and then freeing it) from the host OS. I don't know all the technical details behind this, but that is at least what is going on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With