Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compression performance related to chunk size in hdf5 files

I would like to ask a question about the performance of compression which is related to chunk size of hdf5 files.

I have 2 hdf5 files on hand, which have the following properties. They both only contain one dataset, called "data".

File A's "data":

  1. Type: HDF5 Scalar Dataset
  2. No. of Dimensions: 2
  3. Dimension Size: 5094125 x 6
  4. Max. dimension size: Unlimited x Unlimited
  5. Data type: 64-bit floating point
  6. Chunking: 10000 x 6
  7. Compression: GZIP level = 7

File B's "data":

  1. Type: HDF5 Scalar Dataset
  2. No. of Dimensions: 2
  3. Dimension Size: 6720 x 1000
  4. Max. dimension size: Unlimited x Unlimited
  5. Data type: 64-bit floating point
  6. Chunking: 6000 x 1
  7. Compression: GZIP level = 7

File A's size: HDF5----19 MB CSV-----165 MB

File B's size: HDF5----60 MB CSV-----165 MB

Both of them shows great compression on data stored when comparing to csv files. However, the compression rate of file A is about 10% of original csv, while that of file B is only about 30% of original csv.

I have tried different chunk size to make file B as small as possible, but it seems that 30% is the optimum compression rate. I would like to ask why file A can achieve a greater compression while file B cannot.

If file B can also achieve, what should the chunk size be?

Is that any rule to determine the optimum chunk size of HDF5 for compression purpose?

Thanks!

like image 437
C.T. Avatar asked May 28 '13 07:05

C.T.


People also ask

Does HDF5 compress data?

One of the most powerful features of HDF5 is its ability to store and modify compressed data. The HDF5 Library comes with two pre-defined compression methods, GNUzip (Gzip) and Szip and has the capability of using third-party compression methods as well.

Why are HDF5 files so large?

Although these steps are good for small datasets, the hdf5 file size increases rapidly with the number of images. I have experienced situations where the hdf5 file takes 100x times more space than the original dataset. This solely happens because the numpy array takes more storage space than the original image files.

What is chunk in HDF5?

Chunked Storage That's what chunking does in HDF5. It lets you specify the N-dimensional “shape” that best fits your access pattern. When the time comes to write data to disk, HDF5 splits the data into “chunks” of the specified shape, flattens them, and writes them to disk.

Is HDF5 lossless?

HDF5 also supports lossless compression of datasets.


1 Answers

Chunking doesn't really affect the compression ratio per se, except in the manner @Ümit describes. What chunking does do is affect the I/O performance. When compression is applied to an HDF5 dataset, it is applied to whole chunks, individually. This means that when reading data from a single chunk in a dataset, the entire chunk must be decompressed - possibly involving a whole lot more I/O, depending on the size of the cache, shape of the chunk, etc.

What you should do is make sure that the chunk shape matches how you read/write your data. If you generally read a column at a time, make your chunks columns, for example. This is a good tutorial on chunking.

like image 190
Yossarian Avatar answered Sep 30 '22 10:09

Yossarian