Memory Error while pickling a data frame to disk

Question

I have a 51K X 8.5K data frame with just binary (1 or 0) values.

I wrote the following code:

Pickling the data to the disk

outfile=open("df_preference.p", "wb")
pickle.dump(df_preference,outfile)
outfile.close()

It throws me Memory Error as below:

MemoryError                               Traceback (most recent call last)
<ipython-input-48-de66e880aacb> in <module>()
      2 
      3 outfile=open("df_preference.p", "wb")
----> 4 pickle.dump(df_preference,outfile)
      5 outfile.close()

I am assuming it means this data is huge and it can't be pickled? But it just has binary values.

Before this, I created this dataset from another data frame which had normal counts and lot of zeros. Used the following code:

df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))

This itself took some time to create df_preference. Same size of matrix.

My concern is if it takes time to create a data frame using applymap and ii) doesn't even pickle the data frame due to memory error, then going ahead I need to do matrix factorization of this df_prefence using SVD and Alternating Least Squares. It would then be more slow? How to tackle this slow run and solve the memory error?

Thanks

MaxU - stop WAR against UA · Accepted Answer

UPDATE:

for 1 and 0 values you can use int8 (1-byte) dtype, which will reduce your memory usage by at least 4 times.

(df_recommender > 0).astype(np.int8).to_pickle('/path/to/file.pickle')

Here is an example with 51K x 9K data frame:

In [1]: df = pd.DataFrame(np.random.randint(0, 10, size=(51000, 9000)))

In [2]: df.shape
Out[2]: (51000, 9000)

In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

the source DF needs 1.7 GB in memory

In [6]: df_preference = (df>0).astype(int)

In [7]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

resulting DF again needs 1.7 GB in memory

In [4]: df_preference = (df>0).astype(np.int8)

In [5]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int8(9000)
memory usage: 437.7 MB

with int8 dtype it takes only 438 MB

now let's save it as a Pickle file:

In [10]: df_preference.to_pickle('d:/temp/df_pref.pickle')

file size:

{ temp }  » ls -lh df_pref.pickle
-rw-r--r-- 1 Max None 438M May 28 09:20 df_pref.pickle

OLD answer:

try this instead:

(df_recommender > 0).astype(int).to_pickle('/path/to/file.pickle')

Explanataion:

In [200]: df
Out[200]:
   a  b  c
0  4  3  3
1  1  2  1
2  2  1  0
3  2  0  1
4  2  0  4

In [201]: (df>0).astype(int)
Out[201]:
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  0
3  1  0  1
4  1  0  1

PS you may also want to save your DF as HDF5 file instead of Pickle - see this comparison for details

Memory Error while pickling a data frame to disk

Tags:

Pickling the data to the disk

Baktaawar

1 Answers

MaxU - stop WAR against UA

Recent Activity

Donate For Us

Memory Error while pickling a data frame to disk

Tags:

Pickling the data to the disk

Baktaawar

1 Answers

MaxU - stop WAR against UA

Related questions

Recent Activity

Donate For Us