I have multiple HDF5 datasets saved in the same file, my_file.h5
. These datasets have different dimensions, but the same number of observations in the first dimension:
features.shape = (1000000, 24, 7, 1)
labels.shape = (1000000)
info.shape = (1000000, 4)
It is important that the info/label data is correctly connected to each set of features and I therefore want to shuffle these datasets with an identical seed. Furthermore, I would like to shuffle these without ever loading them fully into memory. Is that possible using numpy and h5py?
Shuffling arrays on disk will be time consuming, as it means that you have allocate new arrays in the hdf5 file, then copy all the rows in a different order. You can iterate over rows (or use chunks of rows), if you want to avoid loading all the data at once into memory with PyTables or h5py.
An alternative approach could be to keep your data as it is and simply to map new row numbers to old row numbers in a separate array (that you can keep fully loaded in RAM, since it will be only 4MB with your array sizes). For instance, to shuffle a numpy array x
,
x = np.random.rand(5)
idx_map = numpy.arange(x.shape[0])
numpy.random.shuffle(idx_map)
Then you can use advanced numpy indexing to access your shuffled data,
x[idx_map[2]] # equivalent to x_shuffled[2]
x[idx_map] # equivament to x_shuffled[:], etc.
this will work also with arrays saved to hdf5. Of course there would be some overhead, as compared to writing shuffled arrays on disk, but it could be sufficient depending on your use-case.
Shuffling arrays like this in numpy
is straight forward
Create the large suffling index (shuffle np.arange(1000000)
) and index the arrays
features = features[I, ...]
labels = labels[I]
info = info[I, :]
This isn't an inplace operation. labels[I]
is a copy of labels
, not a slice or view.
An alternative
features[I,...] = features
looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the I
values are not guaranteed to be unique. In fact there is a special ufunc
.at
method for unbuffered operations.
But look at what h5py
says about this same sort of 'fancy indexing':
http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
labels[I]
selection is implemented, but with restrictions.
List selections may not be empty
Selection coordinates must be given in increasing order
Duplicate selections are ignored
Very long lists (> 1000 elements) may produce poor performance
Your shuffled I
is, by definition not in increasing order. And it is very large.
Also I don't see anything about using this fancy indexing on the left handside, labels[I] = ...
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With