I have a dataframe "DF" with with 500,000 rows. Here are the data types per column:
ID int64
time datetime64[ns]
data object
each entry in the "data" column is an array with size = [5,500]
When I try to save this dataframe using
DF.to_pickle("my_filename.pkl")
it returned me the following error:
12 """
13 with open(path, 'wb') as f:
---> 14 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
I also try this method but I get the same error:
import pickle
with open('my_filename.pkl', 'wb') as f:
pickle.dump(DF, f)
I try to save 10 rows of this dataframe:
DF.head(10).to_pickle('test_save.pkl')
and I have no error at all. Therefore, it can save small DF but not large DF.
I am using python 3, ipython notebook 3 in Mac.
Please help me to solve this problem. I really need to save this DF to a pickle file. I can not find the solution in the internet.
Pandas DataFrame: to_pickle() functionThe to_pickle() function is used to pickle (serialize) object to file. File path where the pickled object will be stored. A string representing the compression to use in the output file. By default, infers from the file extension in specified path.
apply() function to speed it up over 100x. This article takes Pandas' standard dataframe. apply function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds.
On write speeds, PICKLE was 30x faster than CSV, MSGPACK and PARQUET were 10X faster, JSON/HDF about the same as CSV. On storage space, GZIPPED PARQUET gave 40X reduction, GZIPPED CSV gave 10X reduction (didn't compare the rest)
Pickle is around 11 times faster this time, when not compressed. The compression is a huge pain point when reading and saving files. But, let's see how much disk space does it save. The file size decrease when compared to CSV is significant, but the compression doesn't save that much disk space in this case.
Until there is a fix somewhere on pickle/pandas side of things, I'd say a better option is to use alternative IO backend. HDF is suitable for large datasets (GBs). So you don't need to add additional split/combine logic.
df.to_hdf('my_filename.hdf','mydata',mode='w')
df = pd.read_hdf('my_filename.hdf','mydata')
Probably not the answer you were hoping for but this is what I did......
Split the dataframe into smaller chunks using np.array_split (although numpy functions are not guaranteed to work, it does now, although there used to be a bug for it).
Then pickle the smaller dataframes.
When you unpickle them use pandas.append or pandas.concat to glue everything back together.
I agree it is a fudge and suboptimal. If anyone can suggest a "proper" answer I'd be interested in seeing it, but I think it as simple as dataframes are not supposed to get above a certain size.
Split a large pandas dataframe
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With