Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory Error while pickling a data frame to disk

Tags:

I have a 51K X 8.5K data frame with just binary (1 or 0) values.

I wrote the following code:

Pickling the data to the disk

outfile=open("df_preference.p", "wb")
pickle.dump(df_preference,outfile)
outfile.close()

It throws me Memory Error as below:

MemoryError                               Traceback (most recent call last)
<ipython-input-48-de66e880aacb> in <module>()
      2 
      3 outfile=open("df_preference.p", "wb")
----> 4 pickle.dump(df_preference,outfile)
      5 outfile.close()

I am assuming it means this data is huge and it can't be pickled? But it just has binary values.

Before this, I created this dataset from another data frame which had normal counts and lot of zeros. Used the following code:

df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))

This itself took some time to create df_preference. Same size of matrix.

My concern is if it takes time to create a data frame using applymap and ii) doesn't even pickle the data frame due to memory error, then going ahead I need to do matrix factorization of this df_prefence using SVD and Alternating Least Squares. It would then be more slow? How to tackle this slow run and solve the memory error?

Thanks

like image 949
Baktaawar Avatar asked May 27 '16 22:05

Baktaawar


1 Answers

UPDATE:

for 1 and 0 values you can use int8 (1-byte) dtype, which will reduce your memory usage by at least 4 times.

(df_recommender > 0).astype(np.int8).to_pickle('/path/to/file.pickle')

Here is an example with 51K x 9K data frame:

In [1]: df = pd.DataFrame(np.random.randint(0, 10, size=(51000, 9000)))

In [2]: df.shape
Out[2]: (51000, 9000)

In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

the source DF needs 1.7 GB in memory

In [6]: df_preference = (df>0).astype(int)

In [7]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

resulting DF again needs 1.7 GB in memory

In [4]: df_preference = (df>0).astype(np.int8)

In [5]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int8(9000)
memory usage: 437.7 MB

with int8 dtype it takes only 438 MB

now let's save it as a Pickle file:

In [10]: df_preference.to_pickle('d:/temp/df_pref.pickle')

file size:

{ temp }  » ls -lh df_pref.pickle
-rw-r--r-- 1 Max None 438M May 28 09:20 df_pref.pickle

OLD answer:

try this instead:

(df_recommender > 0).astype(int).to_pickle('/path/to/file.pickle')

Explanataion:

In [200]: df
Out[200]:
   a  b  c
0  4  3  3
1  1  2  1
2  2  1  0
3  2  0  1
4  2  0  4

In [201]: (df>0).astype(int)
Out[201]:
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  0
3  1  0  1
4  1  0  1

PS you may also want to save your DF as HDF5 file instead of Pickle - see this comparison for details

like image 144
MaxU - stop WAR against UA Avatar answered Oct 11 '22 15:10

MaxU - stop WAR against UA