I have read several times that turning on compression in HDF5 can lead to better read/write performance.
I wonder what ideal settings can be to achieve good read/write performance at:
data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...)
I'm already using fixed
format (i.e. h5py
) as it's faster than table
. I have strong processors and do not care much about disk space.
I often store DataFrame
s of float64
and str
types in files of approx. 2500 rows x 9000 columns.
The HDF5 file format and library provide flexibility to use a variety of data compression filters on individual datasets in an HDF5 file. Compressed data is stored in chunks and automatically uncompressed by the library and filter plugin when a chunk is accessed.
The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .
HDF5 also supports lossless compression of datasets.
Chunked Storage That's what chunking does in HDF5. It lets you specify the N-dimensional “shape” that best fits your access pattern. When the time comes to write data to disk, HDF5 splits the data into “chunks” of the specified shape, flattens them, and writes them to disk.
There are a couple of possible compression filters that you could use. Since HDF5 version 1.8.11 you can easily register a 3rd party compression filters.
It probably depends on your access pattern because you probably want to define proper dimensions for your chunks so that it aligns well with your access pattern otherwise your performance will suffer a lot. For example if you know that you usually access one column and all rows you should define your chunk shape accordingly (1,9000)
. See here, here and here for some infos.
However AFAIK pandas usually will end up loading the entire HDF5 file into memory unless you use read_table
and an iterator
(see here) or do the partial IO yourself (see here) and thus doesn't really benefit that much of defining a good chunk size.
Nevertheless you might still benefit from compression because loading the compressed data to memory and decompressing it using the CPU is probably faster than loading the uncompressed data.
I would recommend to take a look at Blosc. It is a multi-threaded meta-compressor library that supports various different compression filters:
These have different strengths and the best thing is to try and benchmark them with your data and see which works best.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With