Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the recommended compression for HDF5 for fast read/write performance (in Python/pandas)?

I have read several times that turning on compression in HDF5 can lead to better read/write performance.

I wonder what ideal settings can be to achieve good read/write performance at:

 data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...)

I'm already using fixed format (i.e. h5py) as it's faster than table. I have strong processors and do not care much about disk space.

I often store DataFrames of float64 and str types in files of approx. 2500 rows x 9000 columns.

like image 789
Mark Horvath Avatar asked Jul 13 '15 12:07

Mark Horvath


People also ask

Are HDF5 files compressed?

The HDF5 file format and library provide flexibility to use a variety of data compression filters on individual datasets in an HDF5 file. Compressed data is stored in chunks and automatically uncompressed by the library and filter plugin when a chunk is accessed.

Is HDF5 faster than CSV?

The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .

Is HDF5 lossless?

HDF5 also supports lossless compression of datasets.

What is chunk in HDF5?

Chunked Storage That's what chunking does in HDF5. It lets you specify the N-dimensional “shape” that best fits your access pattern. When the time comes to write data to disk, HDF5 splits the data into “chunks” of the specified shape, flattens them, and writes them to disk.


1 Answers

There are a couple of possible compression filters that you could use. Since HDF5 version 1.8.11 you can easily register a 3rd party compression filters.

Regarding performance:

It probably depends on your access pattern because you probably want to define proper dimensions for your chunks so that it aligns well with your access pattern otherwise your performance will suffer a lot. For example if you know that you usually access one column and all rows you should define your chunk shape accordingly (1,9000). See here, here and here for some infos.

However AFAIK pandas usually will end up loading the entire HDF5 file into memory unless you use read_table and an iterator (see here) or do the partial IO yourself (see here) and thus doesn't really benefit that much of defining a good chunk size.

Nevertheless you might still benefit from compression because loading the compressed data to memory and decompressing it using the CPU is probably faster than loading the uncompressed data.

Regarding your original question:

I would recommend to take a look at Blosc. It is a multi-threaded meta-compressor library that supports various different compression filters:

  • BloscLZ: internal default compressor, heavily based on FastLZ.
  • LZ4: a compact, very popular and fast compressor.
  • LZ4HC: a tweaked version of LZ4, produces better compression ratios at the expense of speed.
  • Snappy: a popular compressor used in many places.
  • Zlib: a classic; somewhat slower than the previous ones, but achieving better compression ratios.

These have different strengths and the best thing is to try and benchmark them with your data and see which works best.

like image 180
Ümit Avatar answered Nov 16 '22 11:11

Ümit