Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

h5py: Correct way to slice array datasets

Tags:

python

numpy

h5py

I'm a bit confused here:

As far as I have understood, h5py's .value method reads an entire dataset and dumps it into an array, which is slow and discouraged (and should be generally replaced by [()]. The correct way is to use numpy-esque slicing.

However, I'm getting irritating results (with h5py 2.2.1):

import h5py
import numpy as np
>>> file = h5py.File("test.hdf5",'w')
# Just fill a test file with a numpy array test dataset
>>> file["test"] = np.arange(0,300000)

# This is TERRIBLY slow?!
>>> file["test"][range(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This is fast
>>> file["test"].value[range(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This is also fast
>>> file["test"].value[np.arange(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This crashes
>>> file["test"][np.arange(0,300000)]

I guess that my dataset is so small that .value doesn't hinder performance significantly, but how can the first option be that slow? What is the preferred version here?

Thanks!

UPDATE It seems that I wasn't clear enough, sorry. I do know that .value copies the whole dataset into memory while slicing only retrieves the appropiate subpart. What I'm wondering is why slicing in file is slower than copying the whole array and then slicing in memory. I always thought hdf5/h5py was implemented specifically so that slicing subparts would always be the fastest.

like image 420
JiaYow Avatar asked Feb 13 '14 21:02

JiaYow


People also ask

How do I take slices from an NP array?

Slice Two-dimensional Numpy Arrays To slice elements from two-dimensional arrays, you need to specify both a row index and a column index as [row_index, column_index] . For example, you can use the index [1,2] to query the element at the second row, third column in precip_2002_2013 .

How does Numpy array slicing work?

Slicing arrays Slicing in python means taking elements from one given index to another given index. We pass slice instead of index like this: [start:end] . We can also define the step, like this: [start:end:step] .

Can HDF5 store strings?

HDF5 supports two string encodings: ASCII and UTF-8. We recommend using UTF-8 when creating HDF5 files, and this is what h5py does by default with Python str objects.

How do I create a h5py file?

Creating HDF5 files The first step to creating a HDF5 file is to initialise it. It uses a very similar syntax to initialising a typical text file in numpy. The first argument provides the filename and location, the second the mode. We're writing the file, so we provide a w for write access.


1 Answers

For fast slicing with h5py, stick to the "plain-vanilla" slice notation:

file['test'][0:300000]

or, for example, reading every other element:

file['test'][0:300000:2]

Simple slicing (slice objects and single integer indices) should be very fast, as it translates directly into HDF5 hyperslab selections.

The expression file['test'][range(300000)] invokes h5py's version of "fancy indexing", namely, indexing via an explicit list of indices. There's no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has abysmal performance when the lists are > 1000 elements. Likewise for file['test'][np.arange(300000)], which is interpreted in the same way.

See also:

[1] http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

[2] https://github.com/h5py/h5py/issues/293

like image 142
Andrew Collette Avatar answered Sep 20 '22 10:09

Andrew Collette