Does the following read from a dataset without loading the entire thing at once into memory [the whole thing will not fit into memory] and get the size of the dataset without loading the data using h5py in python? if not, how? <pre class="prettyprint"><code>h5 = h5py.File('myfile.h5', 'r') mydata = h5.get('matirx') # are all data loaded into memory by using h5.get? part_of_mydata= mydata[1000:11000,:] size_data = mydata.shape </code></pre> Thanks.

<code>get</code> (or indexing) fetches a reference to the Dataset on the file, but does not load any data. <pre class="prettyprint"><code>In [789]: list(f.keys()) Out[789]: ['dset', 'dset1', 'vset'] In [790]: d=f['dset1'] In [791]: d Out[791]: <HDF5 dataset "dset1": shape (2, 3, 10), type "<f8"> In [792]: d.shape # shape of dataset Out[792]: (2, 3, 10) In [793]: arr=d[:,:,:5] # indexing the set fetches part of the data In [794]: arr.shape Out[794]: (2, 3, 5) In [795]: type(d) Out[795]: h5py._hl.dataset.Dataset In [796]: type(arr) Out[796]: numpy.ndarray </code></pre> <code>d</code> the Dataset is array like, but not actually a <code>numpy</code> array. Fetch the whole Dataset with: <pre class="prettyprint"><code>In [798]: arr = d[:] In [799]: type(arr) Out[799]: numpy.ndarray </code></pre> Exactly how of the file it has to read to fetch yourslice depends on the slicing, data layout, chunking, and other things that generally aren't under your control, and shouldn't worry you. Note also that when reading one dataset I'm not loading the others. Same would apply to groups. http://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data

Read from a large file without loading whole thing into memory using h5py

Tags:

python

hdf5

h5py

Does the following read from a dataset without loading the entire thing at once into memory [the whole thing will not fit into memory] and get the size of the dataset without loading the data using h5py in python? if not, how?

h5 = h5py.File('myfile.h5', 'r')
mydata = h5.get('matirx') # are all data loaded into memory by using h5.get?
part_of_mydata= mydata[1000:11000,:]
size_data =  mydata.shape

Thanks.

705

asked Jan 31 '17 03:01

superMind

1 Answers

get (or indexing) fetches a reference to the Dataset on the file, but does not load any data.

In [789]: list(f.keys())
Out[789]: ['dset', 'dset1', 'vset']
In [790]: d=f['dset1']
In [791]: d
Out[791]: <HDF5 dataset "dset1": shape (2, 3, 10), type "<f8">
In [792]: d.shape         # shape of dataset
Out[792]: (2, 3, 10)
In [793]: arr=d[:,:,:5]    # indexing the set fetches part of the data
In [794]: arr.shape
Out[794]: (2, 3, 5)
In [795]: type(d)
Out[795]: h5py._hl.dataset.Dataset
In [796]: type(arr)
Out[796]: numpy.ndarray

d the Dataset is array like, but not actually a numpy array.

Fetch the whole Dataset with:

In [798]: arr = d[:]
In [799]: type(arr)
Out[799]: numpy.ndarray

Exactly how of the file it has to read to fetch yourslice depends on the slicing, data layout, chunking, and other things that generally aren't under your control, and shouldn't worry you.

Note also that when reading one dataset I'm not loading the others. Same would apply to groups.

http://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data

186

answered Oct 18 '22 12:10

hpaulj

Related questions
                            
                                Check if tuple contains at least one of multiple values [duplicate]
                            
                                Array of Enum in Postgres with SQLAlchemy
                            
                                Resize Vertical Header of QTableView in PyQt4?
                            
                                Scroll in Selenium Webdriver (Python)
                            
                                Pandas - aggregate, sort and nlargest inside groupby
                            
                                Python multithreading list append gives unexpected results
                            
                                Bjoern v/s Gunicorn POST requests
                            
                                Highlighting multiple cells in different colors with Pandas
                            
                                How to reproduce UnicodeEncodeError?
                            
                                Python subprocess running in background before returning output
                            
                                Interactive boxplot with pandas and Jupyter notebook
                            
                                Converting time to epoch (Python) [duplicate]
                            
                                Search and replace for text within a pdf, in Python
                            
                                How to create a table as select in pyspark.sql
                            
                                How to set single element of multi dimensional Numpy Array using another Numpy array?
                            
                                How to handle variable length sublist unpacking in Python2?
                            
                                Why does this Python subprocess command only work when shell=True on Windows?
                            
                                PyQt Event when a variable value is changed
                            
                                Upsample and Interpolate a NumPy Array
                            
                                selecting a specific value from a data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With