Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read from a large file without loading whole thing into memory using h5py

Tags:

python

hdf5

h5py

Does the following read from a dataset without loading the entire thing at once into memory [the whole thing will not fit into memory] and get the size of the dataset without loading the data using h5py in python? if not, how?

h5 = h5py.File('myfile.h5', 'r')
mydata = h5.get('matirx') # are all data loaded into memory by using h5.get?
part_of_mydata= mydata[1000:11000,:]
size_data =  mydata.shape 

Thanks.

like image 705
superMind Avatar asked Jan 31 '17 03:01

superMind


People also ask

Why is HDF5 file so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

How do I explore HDF5 files?

Open a HDF5/H5 file in HDFView hdf5 file on your computer. Open this file in HDFView. If you click on the name of the HDF5 file in the left hand window of HDFView, you can view metadata for the file.

What is the use of h5py in Python?

The h5py package is a Pythonic interface to the HDF5 binary data format. HDF5 lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays.


1 Answers

get (or indexing) fetches a reference to the Dataset on the file, but does not load any data.

In [789]: list(f.keys())
Out[789]: ['dset', 'dset1', 'vset']
In [790]: d=f['dset1']
In [791]: d
Out[791]: <HDF5 dataset "dset1": shape (2, 3, 10), type "<f8">
In [792]: d.shape         # shape of dataset
Out[792]: (2, 3, 10)
In [793]: arr=d[:,:,:5]    # indexing the set fetches part of the data
In [794]: arr.shape
Out[794]: (2, 3, 5)
In [795]: type(d)
Out[795]: h5py._hl.dataset.Dataset
In [796]: type(arr)
Out[796]: numpy.ndarray

d the Dataset is array like, but not actually a numpy array.

Fetch the whole Dataset with:

In [798]: arr = d[:]
In [799]: type(arr)
Out[799]: numpy.ndarray

Exactly how of the file it has to read to fetch yourslice depends on the slicing, data layout, chunking, and other things that generally aren't under your control, and shouldn't worry you.

Note also that when reading one dataset I'm not loading the others. Same would apply to groups.

http://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data

like image 186
hpaulj Avatar answered Oct 18 '22 12:10

hpaulj