Given a 1.5 Gb list of pandas dataframes, which format is fastest for loading compressed data: pickle (via cPickle), hdf5, or something else in Python?
Pickle is both slower and produces larger serialized values than most of the alternatives. Pickle is the clear underperformer here. Even the 'cPickle' extension that's written in C has a serialization rate that's about a quarter that of JSON or Thrift.
(a) Categorical Features as Strings An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better.
Cons-1: Pickle is Unsafe Unlike JSON, which is just a piece of string, it is possible to construct malicious pickle data which will execute arbitrary code during unpickling . Therefore, we should NEVER unpickle data that could have come from an untrusted source, or that could have been tampered with.
By default, the pickle data format uses a relatively compact binary representation. If you need optimal size characteristics, you can efficiently compress pickled data.
UPDATE: nowadays I would choose between Parquet, Feather (Apache Arrow), HDF5 and Pickle.
OLD Answer:
I would consider only two storage formats: HDF5 (PyTables) and Feather
Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).
Comparison for the following storage formats: (CSV
, CSV.gzip
, Pickle
, HDF5
[various compression]):
read_s write_s size_ratio_to_CSV storage CSV 17.900 69.00 1.000 CSV.gzip 18.900 186.00 0.047 Pickle 0.173 1.77 0.374 HDF_fixed 0.196 2.03 0.435 HDF_tab 0.230 2.60 0.437 HDF_tab_zlib_c5 0.845 5.44 0.035 HDF_tab_zlib_c9 0.860 5.95 0.035 HDF_tab_bzip2_c5 2.500 36.50 0.011 HDF_tab_bzip2_c9 2.500 36.50 0.011
But it might be different for you, because all my data was of the datetime
dtype, so it's always better to make such a comparison with your real data or at least with the similar data...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With