I have a 51K X 8.5K data frame with just binary (1 or 0) values.
I wrote the following code:
outfile=open("df_preference.p", "wb")
pickle.dump(df_preference,outfile)
outfile.close()
It throws me Memory Error as below:
MemoryError Traceback (most recent call last)
<ipython-input-48-de66e880aacb> in <module>()
2
3 outfile=open("df_preference.p", "wb")
----> 4 pickle.dump(df_preference,outfile)
5 outfile.close()
I am assuming it means this data is huge and it can't be pickled? But it just has binary values.
Before this, I created this dataset from another data frame which had normal counts and lot of zeros. Used the following code:
df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))
This itself took some time to create df_preference. Same size of matrix.
My concern is if it takes time to create a data frame using applymap and ii) doesn't even pickle the data frame due to memory error, then going ahead I need to do matrix factorization of this df_prefence using SVD and Alternating Least Squares. It would then be more slow? How to tackle this slow run and solve the memory error?
Thanks
UPDATE:
for 1
and 0
values you can use int8
(1-byte) dtype, which will reduce your memory usage by at least 4 times.
(df_recommender > 0).astype(np.int8).to_pickle('/path/to/file.pickle')
Here is an example with 51K x 9K data frame:
In [1]: df = pd.DataFrame(np.random.randint(0, 10, size=(51000, 9000)))
In [2]: df.shape
Out[2]: (51000, 9000)
In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB
the source DF needs 1.7 GB in memory
In [6]: df_preference = (df>0).astype(int)
In [7]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB
resulting DF again needs 1.7 GB in memory
In [4]: df_preference = (df>0).astype(np.int8)
In [5]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int8(9000)
memory usage: 437.7 MB
with int8
dtype it takes only 438 MB
now let's save it as a Pickle file:
In [10]: df_preference.to_pickle('d:/temp/df_pref.pickle')
file size:
{ temp } » ls -lh df_pref.pickle
-rw-r--r-- 1 Max None 438M May 28 09:20 df_pref.pickle
OLD answer:
try this instead:
(df_recommender > 0).astype(int).to_pickle('/path/to/file.pickle')
Explanataion:
In [200]: df
Out[200]:
a b c
0 4 3 3
1 1 2 1
2 2 1 0
3 2 0 1
4 2 0 4
In [201]: (df>0).astype(int)
Out[201]:
a b c
0 1 1 1
1 1 1 1
2 1 1 0
3 1 0 1
4 1 0 1
PS you may also want to save your DF as HDF5 file instead of Pickle - see this comparison for details
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With