I'm a bit confused here:
As far as I have understood, h5py's .value
method reads an entire dataset and dumps it into an array, which is slow and discouraged (and should be generally replaced by [()]
. The correct way is to use numpy-esque slicing.
However, I'm getting irritating results (with h5py 2.2.1):
import h5py
import numpy as np
>>> file = h5py.File("test.hdf5",'w')
# Just fill a test file with a numpy array test dataset
>>> file["test"] = np.arange(0,300000)
# This is TERRIBLY slow?!
>>> file["test"][range(0,300000)]
array([ 0, 1, 2, ..., 299997, 299998, 299999])
# This is fast
>>> file["test"].value[range(0,300000)]
array([ 0, 1, 2, ..., 299997, 299998, 299999])
# This is also fast
>>> file["test"].value[np.arange(0,300000)]
array([ 0, 1, 2, ..., 299997, 299998, 299999])
# This crashes
>>> file["test"][np.arange(0,300000)]
I guess that my dataset is so small that .value
doesn't hinder performance significantly, but how can the first option be that slow?
What is the preferred version here?
Thanks!
UPDATE
It seems that I wasn't clear enough, sorry. I do know that .value
copies the whole dataset into memory while slicing only retrieves the appropiate subpart. What I'm wondering is why slicing in file is slower than copying the whole array and then slicing in memory.
I always thought hdf5/h5py was implemented specifically so that slicing subparts would always be the fastest.
Slice Two-dimensional Numpy Arrays To slice elements from two-dimensional arrays, you need to specify both a row index and a column index as [row_index, column_index] . For example, you can use the index [1,2] to query the element at the second row, third column in precip_2002_2013 .
Slicing arrays Slicing in python means taking elements from one given index to another given index. We pass slice instead of index like this: [start:end] . We can also define the step, like this: [start:end:step] .
HDF5 supports two string encodings: ASCII and UTF-8. We recommend using UTF-8 when creating HDF5 files, and this is what h5py does by default with Python str objects.
Creating HDF5 files The first step to creating a HDF5 file is to initialise it. It uses a very similar syntax to initialising a typical text file in numpy. The first argument provides the filename and location, the second the mode. We're writing the file, so we provide a w for write access.
For fast slicing with h5py, stick to the "plain-vanilla" slice notation:
file['test'][0:300000]
or, for example, reading every other element:
file['test'][0:300000:2]
Simple slicing (slice objects and single integer indices) should be very fast, as it translates directly into HDF5 hyperslab selections.
The expression file['test'][range(300000)]
invokes h5py's version of "fancy indexing", namely, indexing via an explicit list of indices. There's no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has abysmal performance when the lists are > 1000 elements. Likewise for file['test'][np.arange(300000)]
, which is interpreted in the same way.
See also:
[1] http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
[2] https://github.com/h5py/h5py/issues/293
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With