Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does h5py read the whole file into memory?

Tags:

h5py

Does h5py read the whole file into the memory?

If so, what if I have a very very big file?

If not, will it be quite slow if I take data from hard disk every time I want a single data? How can I make it faster?

like image 570
Nick Qian Avatar asked Nov 06 '16 13:11

Nick Qian


People also ask

How do HDF5 files work?

The Hierarchical Data Format version 5 (HDF5), is an open source file format that supports large, complex, heterogeneous data. HDF5 uses a "file directory" like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer.

Why is HDF5 file so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

What is h5py used for?

The h5py package is a Pythonic interface to the HDF5 binary data format. HDF5 lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays.

How do I read my HDF5 data?

Reading HDF5 files To open and read data we use the same File method in read mode, r. To see what data is in this file, we can call the keys() method on the file object. We can then grab each dataset we created above using the get method, specifying the name. This returns a HDF5 dataset object.


1 Answers

Does h5py read the whole file into the memory?

No, it does not. In particular, slicing (dataset[50:100]) allows you to load fractions of a dataset into memory. For details, see the h5py docs.

If not, will it be quite slow if I take data from hard disk every time I want a single data?

In general, hdf5 is very fast. But reading from memory is obviously faster than reading from disk. It's your decision how much of a dataset is read into memory (dataset[:] loads the whole dataset).

How can I make it faster?

If you care to optimize performance, you should read the sections about chunking and compression. There's also a book that explains these things in detail (disclaimer: I'm not the author).

like image 117
weatherfrog Avatar answered Sep 20 '22 05:09

weatherfrog