Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it faster to read whole hdf5 dataset than a slice

Tags:

python

io

hdf5

h5py

I'm trying to figure out why this happens:

In [1]: import time, h5py as h5
In [2]: f = h5.File('myfile.hdf5', 'r')                                                                                                                                    
In [3]: st = time.time(); data = f["data"].value[0,:,1,...]; elapsed = time.time() - st;
In [4]: elapsed
Out[4]: 11.127676010131836
In [5]: st = time.time(); data = f["data"][0,:,1,...]; elapsed2 = time.time() - st;
In [6]: elapsed2
Out[6]: 59.810582399368286
In [7]: f["data"].shape
Out[7]: (1, 4096, 6, 16, 16, 16, 16)
In [8]: f["data"].chunks
Out[8]: (1, 4096, 1, 16, 16, 16, 16)

As you can see, loading the whole dataset into memory and then taking a slice is faster than taking that same slice from the dataset.

The chunk size matches the slice, so it should all be contiguous memory, right? Why then is it so much slower?

The dataset is compressed with gzip (opts=2).

Following Andrew's comment, I run it clearing the caches between both reads:

elapsed1: 11.001180410385132
elapsed2: 43.19723725318909
48.61user 4.45system 0:54.65elapsed 97%CPU (0avgtext+0avgdata 8431596maxresident)k
479584inputs+0outputs (106major+3764414minor)pagefaults 0swaps

(This next run had a 10 second delay between the two reads to clear the caches)

elapsed1: 11.46790862083435
elapsed2: 43.438515186309814

48.54user 4.66system 1:05.71elapsed 80%CPU (0avgtext+0avgdata 8431944maxresident)k
732504inputs+0outputs (220major+3764449minor)pagefaults 0swaps
like image 287
mjgalindo Avatar asked Nov 24 '18 13:11

mjgalindo


People also ask

Is HDF5 faster than CSV?

The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .

Why are HDF5 files so large?

Although these steps are good for small datasets, the hdf5 file size increases rapidly with the number of images. I have experienced situations where the hdf5 file takes 100x times more space than the original dataset. This solely happens because the numpy array takes more storage space than the original image files.

Does HDF5 compress data?

One of the most powerful features of HDF5 is its ability to store and modify compressed data. The HDF5 Library comes with two pre-defined compression methods, GNUzip (Gzip) and Szip and has the capability of using third-party compression methods as well.

Why should I use HDF5?

Supports Large, Complex Data: HDF5 is a compressed format that is designed to support large, heterogeneous, and complex datasets. Supports Data Slicing: "Data slicing", or extracting portions of the dataset as needed for analysis, means large files don't need to be completely read into the computers memory or RAM.


1 Answers

First I ran a test of my own. I don't have your HDF5 file, so using one of my test files. My test Table dataset has ~54,000 rows (which seems larger than yours).
Timing result with .value[] gives

>>> elapsed
0.15540122985839844

Timing result with NumPy indexing gives:

>>> elapsed2
0.12980079650878906

So, I don't see much difference in performance. Maybe it's related to the dataset sizes we are testing, or complexity of data tables?

A little reading of the the most recent h5py documentation has some interesting comments about Dataset.value (from Release 2.8.0 - Jun 05, 2018; emphasis mine):
Dataset.value property is now deprecated.
The property Dataset.value, which dates back to h5py 1.0, is deprecated and will be removed in a later release. This property dumps the entire dataset into a NumPy array. Code using .value should be updated to use NumPy indexing, using mydataset[...] or mydataset[()] as appropriate.

Your timing tests seem to be counter to the highlighted observation above.

I think you need to ask a h5py developer to comment on the performance differences (and where data is stored -- in memory vs on disk). Have you checked with the h5py user group?

Edit: After posting, I found this SO Q&A. It has lots of good comments and includes responses from the h5py developer:
h5py: Correct way to slice array datasets

like image 179
kcw78 Avatar answered Nov 14 '22 23:11

kcw78