Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas uses substantially more memory for storage than asked for

I'm using numpy (1.13.1) and pandas (0.20.3) on Ubuntu 16.10 with python 2.7 or 3.5 (same issues on either).

I was investigating pandas memory handling (specifically when it copies or does not copy data) and came across a significant memory issue I don't understand. While I've seen (many) other questions people have regarding its memory performance, I haven't found any that directly addresses this issue.

Specifically, pandas allocates a lot more memory than I ask it to. I noticed some pretty odd behavior when just trying to allocate a DataFrame with a column of a specific size:

import pandas as pd, numpy as np
GB = 1024**3
df = pd.DataFrame()
df['MyCol'] = np.ones(int(1*GB/8), dtype='float64')

When I execute this I see my python process actually allocates 6GB of memory, (12G if I ask for 2GB, 21GB if I ask for 3GB, and my computer breaks if I ask for 4GB :-/) as opposed to the 1GB that was expected. I thought at first maybe Python was doing some aggressive preallocation, however if I'm only constructing the numpy array itself I get exactly how much memory I ask for every time, whether it's 1GB, 10GB, 25GB, whatever.

Further, what's even more interesting is if I change the code slightly to this:

df['MyCol'] = np.ones(int(1*GB), dtype='uint8')

It allocates so much memory it crashes my system (running the numpy call alone correctly allocates 1GB of memory). (Edit 2017/8/17: Out of curiosity I tried this today with updated versions of pandas (0.20.3) and numpy (1.13.1), along with RAM upgraded to 64GB. And running this command is still broken, allocating all 64(ish)GB of available RAM.)

I could understand a doubling or even tripling of the memory asked for if pandas makes a copy and perhaps allocates another column to store indices, but I can't explain what it's actually doing. It isn't exactly clear from a cursory glance at the code either.

I've tried constructing the data frame in a few different ways, all with the same results. Given that others use this package successfully for large data analysis I have to presume I'm doing something horribly wrong, although from what I can tell given the documentation this should be correct.

Thoughts?

Some additional notes:

  1. Even when the memory usage is large, pandas still (incorrectly) reports the expected data size when memory_usage() is called (i.e. if I allocate a 1GB array it reports 1GB, even if 6-10GB has actually been allocated).
  2. In all cases the index has been tiny (as reported by memory_usage(), which may be inaccurate).
  3. Deallocating the pandas DataFrame (df = None, gc.collect()) does not actually free all the memory. There must be a leak somewhere using this method.
like image 657
Anthony Avatar asked Jan 09 '17 16:01

Anthony


1 Answers

So I make a 8000 byte array:

In [248]: x=np.ones(1000)

In [249]: df=pd.DataFrame({'MyCol': x}, dtype=float)
In [250]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
MyCol    1000 non-null float64
dtypes: float64(1)
memory usage: 15.6 KB

So that 8k for the data, and 8k for the index.

I add a column - usage increases by the size of x:

In [251]: df['col2']=x
In [252]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
MyCol    1000 non-null float64
col2     1000 non-null float64
dtypes: float64(2)
memory usage: 23.4 KB

In [253]: x.nbytes
Out[253]: 8000
like image 159
hpaulj Avatar answered Oct 13 '22 22:10

hpaulj