Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store an array in hdf5 file which is too big to load in memory?

Is there any way to store an array in an hdf5 file, which is too big to load in memory?

if I do something like this

f = h5py.File('test.hdf5','w')
f['mydata'] = np.zeros(2**32)

I get a memory error.

like image 433
Sounak Avatar asked Mar 23 '15 11:03

Sounak


People also ask

Why is HDF5 file so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

What is HDF5 chunk?

Chunked Storage That's what chunking does in HDF5. It lets you specify the N-dimensional “shape” that best fits your access pattern. When the time comes to write data to disk, HDF5 splits the data into “chunks” of the specified shape, flattens them, and writes them to disk.

Can HDF5 store strings?

HDF5 supports two string encodings: ASCII and UTF-8. We recommend using UTF-8 when creating HDF5 files, and this is what h5py does by default with Python str objects.

How big is HDF5?

Note that no_persist_A. h5 contains 800 bytes of file metadata and nothing else; there is no user data and no free space in the file. The file size of the empty HDF5 file no_persist_A.


1 Answers

According to the documentation, you can use create_dataset to create a chunked array stored in the hdf5. Example:

>>> import h5py
>>> f = h5py.File('test.h5', 'w')
>>> arr = f.create_dataset('mydata', (2**32,), chunks=True)
>>> arr
<HDF5 dataset "mydata": shape (4294967296,), type "<f4">

Slicing the HDF5 dataset returns Numpy-arrays.

>>> arr[:10]
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)
>>> type(arr[:10])
numpy.array

You can set values as for a Numpy-array.

>>> arr[3:5] = 3
>>> arr[:6]
array([ 0.,  0.,  0.,  3.,  3.,  0.], dtype=float32)

I don't know if this is the most efficient way, but you can iterate over the whole array in chunks. And for instance setting it to random values:

>>> import numpy as np
>>> for i in range(0, arr.size, arr.chunks[0]):
        arr[i: i+arr.chunks[0]] = np.random.randn(arr.chunks[0])
>>> arr[:5]
array([ 0.62833798,  0.03631227,  2.00691652, -0.16631022,  0.07727782], dtype=float32)
like image 89
RickardSjogren Avatar answered Oct 22 '22 16:10

RickardSjogren