What is the recommended compression for HDF5 for fast read/write performance (in Python/pandas)?

Tags:

I have read several times that turning on compression in HDF5 can lead to better read/write performance.

I wonder what ideal settings can be to achieve good read/write performance at:

 data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...)

I'm already using fixed format (i.e. h5py) as it's faster than table. I have strong processors and do not care much about disk space.

I often store DataFrames of float64 and str types in files of approx. 2500 rows x 9000 columns.

789

asked Jul 13 '15 12:07

Mark Horvath

1 Answers

There are a couple of possible compression filters that you could use. Since HDF5 version 1.8.11 you can easily register a 3rd party compression filters.

Regarding performance:

It probably depends on your access pattern because you probably want to define proper dimensions for your chunks so that it aligns well with your access pattern otherwise your performance will suffer a lot. For example if you know that you usually access one column and all rows you should define your chunk shape accordingly (1,9000). See here, here and here for some infos.

However AFAIK pandas usually will end up loading the entire HDF5 file into memory unless you use read_table and an iterator (see here) or do the partial IO yourself (see here) and thus doesn't really benefit that much of defining a good chunk size.

Nevertheless you might still benefit from compression because loading the compressed data to memory and decompressing it using the CPU is probably faster than loading the uncompressed data.

Regarding your original question:

I would recommend to take a look at Blosc. It is a multi-threaded meta-compressor library that supports various different compression filters:

BloscLZ: internal default compressor, heavily based on FastLZ.
LZ4: a compact, very popular and fast compressor.
LZ4HC: a tweaked version of LZ4, produces better compression ratios at the expense of speed.
Snappy: a popular compressor used in many places.
Zlib: a classic; somewhat slower than the previous ones, but achieving better compression ratios.

These have different strengths and the best thing is to try and benchmark them with your data and see which works best.

180

answered Nov 16 '22 11:11

Ümit

Related questions
                            
                                Assert Two Frames Are Not Equal
                            
                                How to convert DatetimeIndexResampler to DataFrame?
                            
                                vectorize percentile value of column B of column A (for groups)
                            
                                Data order in seaborn heatmap from pivot
                            
                                Pandas - Groupby with conditional formula
                            
                                Merge MultiIndex columns together into 1 level [duplicate]
                            
                                Find locations on a curve where the slope changes
                            
                                Python Pandas groupby apply lambda arguments
                            
                                pandas read sql db2 corrupts decimal
                            
                                Remove Minutes and Hours from Series
                            
                                Plotting two histograms from a pandas DataFrame in one subplot using matplotlib
                            
                                Why do I get 'Cannot infer dst time from '2017-10-29 02:04:15', try using the 'ambiguous' argument?
                            
                                pandas groupby aggregate element-wise list addition
                            
                                Pandas group the rows in a dataframe based on specific column value
                            
                                Python Pandas Expand a Column of List of Lists to Two New Column
                            
                                Pandas Dataframe - Bin on multiple columns & get statistics on another column
                            
                                Several time series to DataFrame
                            
                                Replacing values with groupby means
                            
                                Combining two time series in pandas
                            
                                Pandas compiled from source: default pickle behavior changed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the recommended compression for HDF5 for fast read/write performance (in Python/pandas)?

Tags:

pandas

hdf5

compression

hpc

h5py