I'm using numpy (1.13.1) and pandas (0.20.3) on Ubuntu 16.10 with python 2.7 or 3.5 (same issues on either).
I was investigating pandas memory handling (specifically when it copies or does not copy data) and came across a significant memory issue I don't understand. While I've seen (many) other questions people have regarding its memory performance, I haven't found any that directly addresses this issue.
Specifically, pandas allocates a lot more memory than I ask it to. I noticed some pretty odd behavior when just trying to allocate a DataFrame with a column of a specific size:
import pandas as pd, numpy as np
GB = 1024**3
df = pd.DataFrame()
df['MyCol'] = np.ones(int(1*GB/8), dtype='float64')
When I execute this I see my python process actually allocates 6GB of memory, (12G if I ask for 2GB, 21GB if I ask for 3GB, and my computer breaks if I ask for 4GB :-/) as opposed to the 1GB that was expected. I thought at first maybe Python was doing some aggressive preallocation, however if I'm only constructing the numpy array itself I get exactly how much memory I ask for every time, whether it's 1GB, 10GB, 25GB, whatever.
Further, what's even more interesting is if I change the code slightly to this:
df['MyCol'] = np.ones(int(1*GB), dtype='uint8')
It allocates so much memory it crashes my system (running the numpy call alone correctly allocates 1GB of memory). (Edit 2017/8/17: Out of curiosity I tried this today with updated versions of pandas (0.20.3) and numpy (1.13.1), along with RAM upgraded to 64GB. And running this command is still broken, allocating all 64(ish)GB of available RAM.)
I could understand a doubling or even tripling of the memory asked for if pandas makes a copy and perhaps allocates another column to store indices, but I can't explain what it's actually doing. It isn't exactly clear from a cursory glance at the code either.
I've tried constructing the data frame in a few different ways, all with the same results. Given that others use this package successfully for large data analysis I have to presume I'm doing something horribly wrong, although from what I can tell given the documentation this should be correct.
Thoughts?
Some additional notes:
So I make a 8000 byte array:
In [248]: x=np.ones(1000)
In [249]: df=pd.DataFrame({'MyCol': x}, dtype=float)
In [250]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
MyCol 1000 non-null float64
dtypes: float64(1)
memory usage: 15.6 KB
So that 8k for the data, and 8k for the index.
I add a column - usage increases by the size of x
:
In [251]: df['col2']=x
In [252]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
MyCol 1000 non-null float64
col2 1000 non-null float64
dtypes: float64(2)
memory usage: 23.4 KB
In [253]: x.nbytes
Out[253]: 8000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With