Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I copy a multidimensional h5py dataset to a flat 1D Python list without making any intermediate copies?

The question

How can I copy the data from an N x N x N x... h5py dataset over to a 1D standard Python list without making an intermediate copy of the data?

I can think of a few different ways to do this with an intermediate copy. For example:

import h5py
import numpy as np

# initialize list, put some initial data in it
myList = ['foo']

# open up an h5py dataset from a file on disk
myFile = h5py.File('/path-to-my-data', 'r')
myData = myFile['bar']
myData.shape        # returns, for example, (5,15,7)

# copy dataset over to a numpy array
arr = np.zeros(myData.shape)
myData.read_direct(arr)

# finally, add data from copied dataset to myList
myList.extend(arr.flatten())

Can this be done without the intermediate copy to a numpy array?

Some background

(you absolutely do not have to read this unless you're curious)

I'm trying to copy data from an HDF5 file to a Protocol Buffers file via their Python APIs. These are both libraries/frameworks for writing your own complex, serializable data structures. In terms of their Python APIs, HDF5 pretends that its arrays are numpy arrays, whereas Protocol Buffers pretends that its arrays are standard 1D Python lists (sadly, there's no native support in Protocol Buffers for simple multidimensional arrays). Thus my need to convert from an h5py dataset to a Python list.

Edit

There was a request for some clarification about what I meant by

HDF5 pretends that its arrays are numpy arrays, whereas Protocol Buffers pretends that its arrays are standard 1D Python lists

What I mean is that an h5py dataset exposes an interface to the user that is similar to the interface exposed by a numpy array, and that a Python Protobuf repeated numeric field exposes an interface that is similar to that of a standard Python list. However, neither implements the full behavior, or even the full interface, of its prototype. For example, h5py datasets do not have the .flatten() method, and Pybuf repeated fields complain if you try to assign other lists as elements (eg myBuf.repIntField[2] = [1,2,3] will always raise an error).

Here's the relevant line from the Pybuf documentation:

Repeated fields are represented as an object that acts like a Python sequence.

And the relevant lines from the h5py documentation (emphasis added):

Datasets are very similar to NumPy arrays. They are homogenous collections of data elements, with an immutable datatype and (hyper)rectangular shape. Unlike NumPy arrays, they support a variety of transparent storage features such as compression, error-detection, and chunked I/O.

like image 756
tel Avatar asked Nov 10 '22 10:11

tel


1 Answers

For numpy arrays I would suggest using ndarray.flat but h5py Datasets don't have a flat/flatten attribute.

You could create a generator which brings chunks into memory as numpy arrays and then yields values from the flattened values. This could then be converted into a list. For instance to simply chunk along the outer dimension:

def yield_chunks(x):
    for chunk in iter(x):
        yield chunk.flat

myGenerator = itertools.chain(yield_chunk(arr))

myGenerator will yield individual values from arr. You convert this to a list with list(myGenerator).

like image 164
Stephen Pascoe Avatar answered Nov 14 '22 22:11

Stephen Pascoe