Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to get a numpy-style view to a slice of an array stored in a hdf5 file?

I have to work on large 3D cubes of data. I want to store them in HDF5 files (using h5py or maybe pytables). I often want to perform analysis on just a section of these cubes. This section is too large to hold in memory. I would like to have a numpy style view to my slice of interest, without copying the data to memory (similar to what you could do with a numpy memmap). Is this possible? As far as I know, performing a slice using h5py, you get a numpy array in memory.

It has been asked why I would want to do this, since the data has to enter memory at some point anyway. My code, out of necessity, already run piecemeal over data from these cubes, pulling small bits into memory at a time. These functions are simplest if they simply iterate over the entirety of the datasets passed to them. If I could have a view to the data on disk, I simply could pass this view to these functions unchanged. If I cannot have a view, I need to write all my functions to only iterate over the slice of interest. This will add complexity to the code, and make it more likely for human error during analysis.

Is there any way to get a view to the data on disk, without copying to memory?

like image 209
Caleb Avatar asked Jan 06 '15 16:01

Caleb


1 Answers

One possibility is to create a generator that yields the elements of the slice one by one. Once you have such a generator, you can pass it to your existing code and iterate through the generator as normal. As an example, you can use a for loop on a generator, just as you might use it on a slice. Generators do not store all of their values at once, they 'generate' them as needed.

You might be able create a slice of just the locations of the cube you want, but not the data itself, or you could generate the next location of your slice programmatically if you have too many locations to store in memory as well. A generator could use those locations to yield the data they contain one by one.

Assuming your slices are the (possibly higher-dimensional) equivalent of cuboids, you might generate coordinates using nested for-range() loops, or by applying product() from the itertools module to range objects.

like image 53
IFcoltransG Avatar answered Sep 21 '22 04:09

IFcoltransG