I'm interested to find out the increase in the total size of python's heap when a large object is loaded. heapy seems to be what I need, but I don't understand the results.
I have a 350 MB pickle file with a pandas DataFrame
in it, which contains about 2.5 million entries. When I load the file and inspect the heap with heapy afterwards, it reports that only roughly 8 MB of objects have been added to the heap.
import guppy
h = guppy.hpy()
h.setrelheap()
df = pickle.load(open('test-df.pickle'))
h.heap()
This gives the following output:
Partition of a set of 95278 objects. Total size = 8694448 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 44700 47 4445944 51 4445944 51 str
1 25595 27 1056560 12 5502504 63 tuple
2 6935 7 499320 6 6001824 69 types.CodeType
...
What confuses me is the Total size
of 8694448 bytes
. That's just 8 MB.
Why doesn't Total size
reflect the size of the whole DataFrame
df
?
(Using python 2.7.3, heapy 0.1.10, Linux 3.2.0-48-generic-pae (Ubuntu), i686 )
You could try pympler, which worked for me the last time I checked. If you are just interested in the total memory increase and not for a specific class, you could you an OS specific call to get the total memory used. Eg, on unix based OS, you could do something like the following before and after loading the object to get the diff.
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
I had a similar problem when I was trying to find out why my 500 MB CSV files were taking up to 5 GB in memory. Pandas is basically build on top of Numpy, and therefore uses C malloc to allocate space. This is why it doesn't show up in heapy, which only profile pure Python objects. One solution might be to look into valgrind to track down your memory leaks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With