Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas to_pickle cannot pickle large dataframes

I have a dataframe "DF" with with 500,000 rows. Here are the data types per column:

ID      int64
time    datetime64[ns]
data    object

each entry in the "data" column is an array with size = [5,500]

When I try to save this dataframe using

DF.to_pickle("my_filename.pkl")

it returned me the following error:

     12     """
     13     with open(path, 'wb') as f:
---> 14         pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL) 

OSError: [Errno 22] Invalid argument

I also try this method but I get the same error:

import pickle


with open('my_filename.pkl', 'wb') as f:
    pickle.dump(DF, f)

I try to save 10 rows of this dataframe:

DF.head(10).to_pickle('test_save.pkl')

and I have no error at all. Therefore, it can save small DF but not large DF.

I am using python 3, ipython notebook 3 in Mac.

Please help me to solve this problem. I really need to save this DF to a pickle file. I can not find the solution in the internet.

like image 383
Joseph Roxas Avatar asked Apr 09 '15 19:04

Joseph Roxas


People also ask

Can a Pandas DataFrame be pickled?

Pandas DataFrame: to_pickle() functionThe to_pickle() function is used to pickle (serialize) object to file. File path where the pickled object will be stored. A string representing the compression to use in the output file. By default, infers from the file extension in specified path.

How can I make my Pandas 100x faster?

apply() function to speed it up over 100x. This article takes Pandas' standard dataframe. apply function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds.

Which is faster pickle or parquet?

On write speeds, PICKLE was 30x faster than CSV, MSGPACK and PARQUET were 10X faster, JSON/HDF about the same as CSV. On storage space, GZIPPED PARQUET gave 40X reduction, GZIPPED CSV gave 10X reduction (didn't compare the rest)

Is pickle better than CSV?

Pickle is around 11 times faster this time, when not compressed. The compression is a huge pain point when reading and saving files. But, let's see how much disk space does it save. The file size decrease when compared to CSV is significant, but the compression doesn't save that much disk space in this case.


2 Answers

Until there is a fix somewhere on pickle/pandas side of things, I'd say a better option is to use alternative IO backend. HDF is suitable for large datasets (GBs). So you don't need to add additional split/combine logic.

df.to_hdf('my_filename.hdf','mydata',mode='w')

df = pd.read_hdf('my_filename.hdf','mydata')
like image 186
volodymyr Avatar answered Sep 18 '22 09:09

volodymyr


Probably not the answer you were hoping for but this is what I did......

Split the dataframe into smaller chunks using np.array_split (although numpy functions are not guaranteed to work, it does now, although there used to be a bug for it).

Then pickle the smaller dataframes.

When you unpickle them use pandas.append or pandas.concat to glue everything back together.

I agree it is a fudge and suboptimal. If anyone can suggest a "proper" answer I'd be interested in seeing it, but I think it as simple as dataframes are not supposed to get above a certain size.

Split a large pandas dataframe

like image 30
Yupsiree Avatar answered Sep 22 '22 09:09

Yupsiree