Which is better, using Pandas built-in method or pickle.dump
?
The standard pickle method looks like this:
pickle.dump(my_dataframe, open('test_pickle.p', 'wb'))
The Pandas built-in method looks like this:
my_dataframe.to_pickle('test_pickle.p')
Pandas DataFrame: to_pickle() functionThe to_pickle() function is used to pickle (serialize) object to file. File path where the pickled object will be stored. A string representing the compression to use in the output file.
There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.
Pickle: Pickle is the native format of python that is popular for object serialization. The advantage of pickle is that it allows the python code to implement any type of enhancements. It is much faster when compared to CSV files and reduces the file size to almost half of CSV files using its compression techniques.
Thanks to @qwwqwwq I discovered that pandas has a built-in to_pickle
method for dataframes. I did a quick time test:
In [1]: %timeit pickle.dump(df, open('test_pickle.p', 'wb'))
10 loops, best of 3: 91.8 ms per loop
In [2]: %timeit df.to_pickle('testpickle.p')
10 loops, best of 3: 88 ms per loop
So it seems that the built-in is only narrowly better (to me, this is useful because it means it's probably not worth refactoring code to use the built-in) - hope this helps someone!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With