compressed files bigger in h5py

Tags:

I'm using h5py to save numpy arrays in HDF5 format from python. Recently, I tried to apply compression and the size of the files I get is bigger...

I went from things (every file has several datasets) like this

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, 
         dtype=float, data=estimated_pos)

to things like this

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, dtype=float,
        data=estimated_pos, compression="gzip", compression_opts=9)

In particular examples, the size of the compressed file is 172K and that of the uncompressed file is 72K (and h5diff reports both files are equal). I tried a more basic example and it works as expected...but not in my program.

How is that possible? I don't think gzip algorithm ever gives a bigger compressed file, so it's probably related with h5py and use thereof :-/ Any ideas?

Cheers!!

EDIT:

At the sight of the output from h5stat, it seems the compressed version saves a lot of metadata (in the last few lines of the output)

compressed file

Filename: res_totolaca_jue_2015-10-08_17:06:30_19387.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3798/503
        Datasets(exclude compact data): 15904/9254
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 116824
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 33602
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 2
    Dataset layout counts[CHUNKED]: 54
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 2
        GZIP filter: 54
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 136526 bytes
  Raw data: 33602 bytes
  Unaccounted space: 5111 bytes
Total space: 175239 bytes

uncompressed file

Filename: res_totolaca_jue_2015-10-08_17:03:04_19267.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3663/452
        Datasets(exclude compact data): 15904/10200
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 0
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 50600
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 56
    Dataset layout counts[CHUNKED]: 0
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 56
        GZIP filter: 0
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 19567 bytes
  Raw data: 50600 bytes
  Unaccounted space: 5057 bytes
Total space: 75224 bytes

555

asked Oct 07 '15 14:10

manu

1 Answers

First, here's a reproducible example:

import h5py
from scipy.misc import lena

img = lena()    # some compressible image data

f1 = h5py.File('nocomp.h5', 'w')
f1.create_dataset('img', data=img)
f1.close()

f2 = h5py.File('complevel_9.h5', 'w')
f2.create_dataset('img', data=img, compression='gzip', compression_opts=9)
f2.close()

f3 = h5py.File('complevel_0.h5', 'w')
f3.create_dataset('img', data=img, compression='gzip', compression_opts=0)
f3.close()

Now let's look at the file sizes:

~$ h5stat -S nocomp.h5
Filename: nocomp.h5
Summary of file space information:
  File metadata: 1304 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 840 bytes
Total space: 2099296 bytes

~$ h5stat -S complevel_9.h5
Filename: complevel_9.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 302850 bytes
  Unaccounted space: 1816 bytes
Total space: 316434 bytes

~$ h5stat -S complevel_0.h5
Filename: complevel_0.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2098560 bytes
  Unaccounted space: 1816 bytes
Total space: 2112144 bytes

In my example, compression with gzip -9 makes sense - although it requires an extra ~10kB of metadata, this is more than outweighed by a ~1794kB decrease in the size of the image data (about a 7:1 compression ratio). The net result is a ~6.6 fold reduction in total file size.

However, in your example the compression only reduces the size of your raw data by ~16kB (a compression ratio of about 1.5:1), which is massively outweighed by a 116kB increase in the size of the metadata. The reason why the increase in metadata size is so much larger than for my example is probably because your file contains 56 datasets rather than just one.

Even if gzip magically reduced the size of your raw data to zero you would still end up with a file that was ~1.8 times larger than the uncompressed version. The size of the metadata is more or less guaranteed to scale sublinearly with the size of your arrays, so if your datasets were much larger then you would start to see some benefit from compressing them. As it stands, your array is so small that it's unlikely that you'll gain anything from compression.

Update:

The reason why the compressed version needs so much more metadata is not really to do with the compression per se, but rather to do with the fact that in order to use compression filters the dataset needs to be split into fixed-size chunks. Presumably a lot of the extra metadata is being used to store the B-tree that is needed to index the chunks.

f4 = h5py.File('nocomp_autochunked.h5', 'w')
# let h5py pick a chunk size automatically
f4.create_dataset('img', data=img, chunks=True)
print(f4['img'].chunks)
# (32, 64)
f4.close()

f5 = h5py.File('nocomp_onechunk.h5', 'w')
# make the chunk shape the same as the shape of the array, so that there 
# is only one chunk
f5.create_dataset('img', data=img, chunks=img.shape)
print(f5['img'].chunks)
# (512, 512)
f5.close()

f6 = h5py.File('complevel_9_onechunk.h5', 'w')
f6.create_dataset('img', data=img, chunks=img.shape, compression='gzip',
                  compression_opts=9)
f6.close()

And the resulting file sizes:

~$ h5stat -S nocomp_autochunked.h5
Filename: nocomp_autochunked.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 1816 bytes
Total space: 2110736 bytes

~$ h5stat -S nocomp_onechunk.h5
Filename: nocomp_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 96 bytes
Total space: 2101168 bytes

~$ h5stat -S complevel_9_onechunk.h5
Filename: complevel_9_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 305051 bytes
  Unaccounted space: 96 bytes
Total space: 309067 bytes

It's obvious that chunking is what incurs the extra metadata rather than compression, since nocomp_autochunked.h5 contains exactly the same amount of metadata as complevel_0.h5 above, and introducing compression to the chunked version in complevel_9_onechunk.h5 made no difference to the total amount of metadata.

Increasing the chunk size such that the array is stored as a single chunk reduced the amount of metadata by a factor of about 3 in this example. How much difference this would make in your case will probably depend on how h5py automatically selects a chunk size for your input dataset. Interestingly this also resulted in a very slight reduction in the compression ratio, which is not what I would have predicted.

Bear in mind that there are also disadvantages to having larger chunks. Whenever you want to access a single element within a chunk, the whole chunk needs to be decompressed and read into memory. For a large dataset this can be disastrous for performance, but in your case the arrays are so small that it's probably not worth worrying about.

Another thing you should consider is whether you can store your datasets within a single array rather than lots of small arrays. For example, if you have K 2D arrays of the same dtype that each have dimensions MxN then you could store them more efficiently in a KxMxN 3D array rather than lots of small datasets. I don't know enough about your data to know whether this is feasible.

190

answered Sep 23 '22 05:09

ali_m

Related questions
                            
                                Filter special chars such as color codes from shell output
                            
                                Does scikit-learn perform "real" multivariate regression (multiple dependent variables)?
                            
                                Run and execute a python script from VBA
                            
                                Pygame draw anti-aliased thick line
                            
                                Get inner text from lxml
                            
                                Python Sphinx anchor on arbitrary line
                            
                                Django, How to make multiple annotate in a single queryset
                            
                                How to close a QDialog
                            
                                Unable to Include Jinja2 Template to Pyinstaller Distribution
                            
                                How to get apache to serve static files on Flask webapp
                            
                                Find maximum of each row in a numpy array and the corresponding element in another array of the same size
                            
                                How to save / serialize a trained model in theano?
                            
                                Get value of a form input by ID python/flask
                            
                                How to run a command only if is the master branch in travis-ci?
                            
                                linear regression for timeseries python (numpy or pandas)
                            
                                How to annotate seaborn pairplots?
                            
                                Why is adding to or removing from the middle of a collections.deque slower than lookup there?
                            
                                How to customize a scatter matrix to see all titles?
                            
                                Load part of a json in python
                            
                                Solving a system of odes (with changing constant!) using scipy.integrate.odeint?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

compressed files bigger in h5py

Tags:

python

hdf5

compression

numpy

h5py