Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

which is faster for load: pickle or hdf5 in python

Given a 1.5 Gb list of pandas dataframes, which format is fastest for loading compressed data: pickle (via cPickle), hdf5, or something else in Python?

  • I only care about fastest speed to load the data into memory
  • I don't care about dumping the data, it's slow but I only do this once.
  • I don't care about file size on disk
like image 269
jesperk.eth Avatar asked Jun 20 '16 17:06

jesperk.eth


People also ask

Are Python pickles faster?

Pickle is both slower and produces larger serialized values than most of the alternatives. Pickle is the clear underperformer here. Even the 'cPickle' extension that's written in C has a serialization rate that's about a quarter that of JSON or Thrift.

Is HDF5 faster than CSV?

(a) Categorical Features as Strings An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better.

Is pickle good in Python?

Cons-1: Pickle is Unsafe Unlike JSON, which is just a piece of string, it is possible to construct malicious pickle data which will execute arbitrary code during unpickling . Therefore, we should NEVER unpickle data that could have come from an untrusted source, or that could have been tampered with.

Does Python pickle compress data?

By default, the pickle data format uses a relatively compact binary representation. If you need optimal size characteristics, you can efficiently compress pickled data.


1 Answers

UPDATE: nowadays I would choose between Parquet, Feather (Apache Arrow), HDF5 and Pickle.

Pro's and Contra's:

  • Parquet
    • pros
      • one of the fastest and widely supported binary storage formats
      • supports very fast compression methods (for example Snappy codec)
      • de-facto standard storage format for Data Lakes / BigData
    • contras
      • the whole dataset must be read into memory. You can't read a smaller subset. One way to overcome this problem is to use partitioning and to read only required partitions.
        • no support for indexing. you can't read a specific row or a range of rows - you always have to read the whole Parquet file
      • Parquet files are immutable - you can't change them (no way to append, update, delete), one can only either write or overwrite to Parquet file. Well this "limitation" comes from the BigData and would be considered as one of the huge "pros" there.
  • HDF5
    • pros
      • supports data slicing - ability to read a portion of the whole dataset (we can work with datasets that wouldn't fit completely into RAM).
      • relatively fast binary storage format
      • supports compression (though the compression is slower compared to Snappy codec (Parquet) )
      • supports appending rows (mutable)
    • contras
      • risk of data corruption
  • Pickle
    • pros
      • very fast
    • contras
      • requires much space on disk
      • for a long term storage one might experience compatibility problems. You might need to specify the Pickle version for reading old Pickle files.

OLD Answer:

I would consider only two storage formats: HDF5 (PyTables) and Feather

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV storage CSV               17.900    69.00              1.000 CSV.gzip          18.900   186.00              0.047 Pickle             0.173     1.77              0.374 HDF_fixed          0.196     2.03              0.435 HDF_tab            0.230     2.60              0.437 HDF_tab_zlib_c5    0.845     5.44              0.035 HDF_tab_zlib_c9    0.860     5.95              0.035 HDF_tab_bzip2_c5   2.500    36.50              0.011 HDF_tab_bzip2_c9   2.500    36.50              0.011 

But it might be different for you, because all my data was of the datetime dtype, so it's always better to make such a comparison with your real data or at least with the similar data...

like image 189
MaxU - stop WAR against UA Avatar answered Sep 26 '22 07:09

MaxU - stop WAR against UA