Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

h5py: slicing dataset without loading into memory

Tags:

python

numpy

h5py

Is it possible to slice an h5py dataset in two subsets without actually loading them into memory? E.g.:

dset = h5py.File("/2tbhd/tst.h5py","r")

X_train = dset['X'][:N/2]
X_test  = dset['X'][N/2:-1]
like image 436
memecs Avatar asked Sep 29 '22 22:09

memecs


1 Answers

No.

You would need to implement your own class to act as a view on the dataset. An old thread on the h5py mailing list indicates that such a DatasetView class is theoretically possible to implement using HDF5 dataspaces, but probably not worth it for many use cases. Element-wise access would be very slow compared to a normal numpy array (assuming you can fit your data into memory).

Edit: If you want to avoid messing with HDF5 data spaces (whatever that means), you might settle for a simpler approach. Try this gist I just wrote. Use it like this:

dset = h5py.File("/2tbhd/tst.h5py","r")

from simpleview import SimpleView
X_view = SimpleView(dset['X'])

# Stores slices, but doesn't load into memory
X_train = X_view[:N/2]
X_test  = X_view[N/2:-1]

# These statements will load the data into memory.
print numpy.sum(X_train)
print numpy.array(X_test)[0]

Note that the slicing support in this simple example is somewhat limited. If you want full slicing and element-wise access, you'll have to copy it into a real array:

X_train_copy = numpy.array(X_train)
like image 103
Stuart Berg Avatar answered Oct 03 '22 01:10

Stuart Berg