I would like to ask a question about the performance of compression which is related to chunk size of hdf5 files.
I have 2 hdf5 files on hand, which have the following properties. They both only contain one dataset, called "data".
File A's "data":
File B's "data":
File A's size: HDF5----19 MB CSV-----165 MB
File B's size: HDF5----60 MB CSV-----165 MB
Both of them shows great compression on data stored when comparing to csv files. However, the compression rate of file A is about 10% of original csv, while that of file B is only about 30% of original csv.
I have tried different chunk size to make file B as small as possible, but it seems that 30% is the optimum compression rate. I would like to ask why file A can achieve a greater compression while file B cannot.
If file B can also achieve, what should the chunk size be?
Is that any rule to determine the optimum chunk size of HDF5 for compression purpose?
Thanks!
One of the most powerful features of HDF5 is its ability to store and modify compressed data. The HDF5 Library comes with two pre-defined compression methods, GNUzip (Gzip) and Szip and has the capability of using third-party compression methods as well.
Although these steps are good for small datasets, the hdf5 file size increases rapidly with the number of images. I have experienced situations where the hdf5 file takes 100x times more space than the original dataset. This solely happens because the numpy array takes more storage space than the original image files.
Chunked Storage That's what chunking does in HDF5. It lets you specify the N-dimensional “shape” that best fits your access pattern. When the time comes to write data to disk, HDF5 splits the data into “chunks” of the specified shape, flattens them, and writes them to disk.
HDF5 also supports lossless compression of datasets.
Chunking doesn't really affect the compression ratio per se, except in the manner @Ümit describes. What chunking does do is affect the I/O performance. When compression is applied to an HDF5 dataset, it is applied to whole chunks, individually. This means that when reading data from a single chunk in a dataset, the entire chunk must be decompressed - possibly involving a whole lot more I/O, depending on the size of the cache, shape of the chunk, etc.
What you should do is make sure that the chunk shape matches how you read/write your data. If you generally read a column at a time, make your chunks columns, for example. This is a good tutorial on chunking.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With