Given a 1.5 Gb list of pandas dataframes, which format is fastest for loading compressed data: pickle (via cPickle), hdf5, or something else in Python? <ul> <li>I only care about fastest speed to load the data into memory</li> <li>I don't care about dumping the data, it's slow but I only do this once.</li> <li>I don't care about file size on disk</li> </ul>

UPDATE: nowadays I would choose between Parquet, Feather (Apache Arrow), HDF5 and Pickle. <h3>Pro's and Contra's:</h3> <ul> <li> Parquet <ul> <li> pros <ul> <li>one of the fastest and widely supported binary storage formats</li> <li>supports very fast compression methods (for example Snappy codec)</li> <li>de-facto standard storage format for Data Lakes / BigData</li> </ul> </li> <li> contras <ul> <li>the whole dataset must be read into memory. You can't read a smaller subset. One way to overcome this problem is to use partitioning and to read only required partitions. <ul> <li>no support for indexing. you can't read a specific row or a range of rows - you always have to read the whole Parquet file</li> </ul> </li> <li>Parquet files are immutable - you can't change them (no way to append, update, delete), one can only either write or overwrite to Parquet file. Well this "limitation" comes from the BigData and would be considered as one of the huge "pros" there.</li> </ul> </li> </ul> </li> <li> HDF5 <ul> <li> pros <ul> <li>supports data slicing - ability to read a portion of the whole dataset (we can work with datasets that wouldn't fit completely into RAM).</li> <li>relatively fast binary storage format</li> <li>supports compression (though the compression is slower compared to Snappy codec (Parquet) )</li> <li>supports appending rows (mutable)</li> </ul> </li> <li> contras <ul> <li>risk of data corruption</li> </ul> </li> </ul> </li> <li> Pickle <ul> <li> pros <ul> <li>very fast</li> </ul> </li> <li> contras <ul> <li>requires much space on disk</li> <li>for a long term storage one might experience compatibility problems. You might need to specify the Pickle version for reading old Pickle files.</li> </ul> </li> </ul> </li> </ul> <hr> OLD Answer: I would consider only two storage formats: HDF5 (PyTables) and Feather Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB). Comparison for the following storage formats: (<code>CSV</code>, <code>CSV.gzip</code>, <code>Pickle</code>, <code>HDF5</code> [various compression]): <pre class="prettyprint"><code> read_s write_s size_ratio_to_CSV storage CSV 17.900 69.00 1.000 CSV.gzip 18.900 186.00 0.047 Pickle 0.173 1.77 0.374 HDF_fixed 0.196 2.03 0.435 HDF_tab 0.230 2.60 0.437 HDF_tab_zlib_c5 0.845 5.44 0.035 HDF_tab_zlib_c9 0.860 5.95 0.035 HDF_tab_bzip2_c5 2.500 36.50 0.011 HDF_tab_bzip2_c9 2.500 36.50 0.011 </code></pre> But it might be different for you, because all my data was of the <code>datetime</code> dtype, so it's always better to make such a comparison with your real data or at least with the similar data...

which is faster for load: pickle or hdf5 in python

1 Answers

UPDATE: nowadays I would choose between Parquet, Feather (Apache Arrow), HDF5 and Pickle.

Pro's and Contra's:

Parquet
- pros
  - one of the fastest and widely supported binary storage formats
  - supports very fast compression methods (for example Snappy codec)
  - de-facto standard storage format for Data Lakes / BigData
- contras
  - the whole dataset must be read into memory. You can't read a smaller subset. One way to overcome this problem is to use partitioning and to read only required partitions.
    - no support for indexing. you can't read a specific row or a range of rows - you always have to read the whole Parquet file
  - Parquet files are immutable - you can't change them (no way to append, update, delete), one can only either write or overwrite to Parquet file. Well this "limitation" comes from the BigData and would be considered as one of the huge "pros" there.
HDF5
- pros
  - supports data slicing - ability to read a portion of the whole dataset (we can work with datasets that wouldn't fit completely into RAM).
  - relatively fast binary storage format
  - supports compression (though the compression is slower compared to Snappy codec (Parquet) )
  - supports appending rows (mutable)
- contras
  - risk of data corruption
Pickle
- pros
  - very fast
- contras
  - requires much space on disk
  - for a long term storage one might experience compatibility problems. You might need to specify the Pickle version for reading old Pickle files.

OLD Answer:

I would consider only two storage formats: HDF5 (PyTables) and Feather

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV storage CSV               17.900    69.00              1.000 CSV.gzip          18.900   186.00              0.047 Pickle             0.173     1.77              0.374 HDF_fixed          0.196     2.03              0.435 HDF_tab            0.230     2.60              0.437 HDF_tab_zlib_c5    0.845     5.44              0.035 HDF_tab_zlib_c9    0.860     5.95              0.035 HDF_tab_bzip2_c5   2.500    36.50              0.011 HDF_tab_bzip2_c9   2.500    36.50              0.011

But it might be different for you, because all my data was of the datetime dtype, so it's always better to make such a comparison with your real data or at least with the similar data...

189

answered Sep 26 '22 07:09

MaxU - stop WAR against UA

Related questions
                            
                                Concatenate sparse matrices in Python using SciPy/Numpy
                            
                                ModuleNotFoundError with pytest
                            
                                whats the fastest way to find eigenvalues/vectors in python?
                            
                                Equivalent of Paste R to Python
                            
                                Duplicating model instances and their related objects in Django / Algorithm for recusrively duplicating an object
                            
                                How do YOU deploy your WSGI application? (and why it is the best way)
                            
                                python 2 code: if python 3 then sys.exit()
                            
                                Beautiful Soup findAll doesn't find them all
                            
                                How can I decorate an instance method with a decorator class?
                            
                                Peeking in a heap in python
                            
                                Find which python modules are being imported
                            
                                how to check DEBUG true/false in django template - exactly in layout.html [duplicate]
                            
                                Beginner Python Practice? [closed]
                            
                                How to iterate Queue.Queue items in Python?
                            
                                How do you call an instance of a class in Python?
                            
                                Remove first x number of characters from each row in a column of a Python dataframe
                            
                                What is the correct way to change image channel ordering between channels first and channels last?
                            
                                Facebook JSON badly encoded
                            
                                Can I get an item from a PriorityQueue without removing it yet?
                            
                                Python argument parser list of list or tuple of tuples

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

which is faster for load: pickle or hdf5 in python

Tags:

python

pandas

dataframe

hdf5

numpy

jesperk.eth

People also ask

1 Answers

Pro's and Contra's:

MaxU - stop WAR against UA

Recent Activity

Donate For Us