I'm a bit confused here: As far as I have understood, h5py's <code>.value</code> method reads an entire dataset and dumps it into an array, which is slow and discouraged (and should be generally replaced by <code>[()]</code>. The correct way is to use numpy-esque slicing. However, I'm getting irritating results (with h5py 2.2.1): <pre class="prettyprint"><code>import h5py import numpy as np >>> file = h5py.File("test.hdf5",'w') # Just fill a test file with a numpy array test dataset >>> file["test"] = np.arange(0,300000) # This is TERRIBLY slow?! >>> file["test"][range(0,300000)] array([ 0, 1, 2, ..., 299997, 299998, 299999]) # This is fast >>> file["test"].value[range(0,300000)] array([ 0, 1, 2, ..., 299997, 299998, 299999]) # This is also fast >>> file["test"].value[np.arange(0,300000)] array([ 0, 1, 2, ..., 299997, 299998, 299999]) # This crashes >>> file["test"][np.arange(0,300000)] </code></pre> I guess that my dataset is so small that <code>.value</code> doesn't hinder performance significantly, but how can the first option be that slow? What is the preferred version here? Thanks! UPDATE It seems that I wasn't clear enough, sorry. I do know that <code>.value</code> copies the whole dataset into memory while slicing only retrieves the appropiate subpart. What I'm wondering is why slicing in file is slower than copying the whole array and then slicing in memory. I always thought hdf5/h5py was implemented specifically so that slicing subparts would always be the fastest.

For fast slicing with h5py, stick to the "plain-vanilla" slice notation: <pre class="prettyprint"><code>file['test'][0:300000] </code></pre> or, for example, reading every other element: <pre class="prettyprint"><code>file['test'][0:300000:2] </code></pre> Simple slicing (slice objects and single integer indices) should be very fast, as it translates directly into HDF5 hyperslab selections. The expression <code>file['test'][range(300000)]</code> invokes h5py's version of "fancy indexing", namely, indexing via an explicit list of indices. There's no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has abysmal performance when the lists are > 1000 elements. Likewise for <code>file['test'][np.arange(300000)]</code>, which is interpreted in the same way. See also: [1] http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing [2] https://github.com/h5py/h5py/issues/293

h5py: Correct way to slice array datasets

Tags:

python

numpy

h5py

I'm a bit confused here:

As far as I have understood, h5py's .value method reads an entire dataset and dumps it into an array, which is slow and discouraged (and should be generally replaced by [()]. The correct way is to use numpy-esque slicing.

However, I'm getting irritating results (with h5py 2.2.1):

import h5py
import numpy as np
>>> file = h5py.File("test.hdf5",'w')
# Just fill a test file with a numpy array test dataset
>>> file["test"] = np.arange(0,300000)

# This is TERRIBLY slow?!
>>> file["test"][range(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This is fast
>>> file["test"].value[range(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This is also fast
>>> file["test"].value[np.arange(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This crashes
>>> file["test"][np.arange(0,300000)]

I guess that my dataset is so small that .value doesn't hinder performance significantly, but how can the first option be that slow? What is the preferred version here?

Thanks!

UPDATE It seems that I wasn't clear enough, sorry. I do know that .value copies the whole dataset into memory while slicing only retrieves the appropiate subpart. What I'm wondering is why slicing in file is slower than copying the whole array and then slicing in memory. I always thought hdf5/h5py was implemented specifically so that slicing subparts would always be the fastest.

420

asked Feb 13 '14 21:02

JiaYow

1 Answers

For fast slicing with h5py, stick to the "plain-vanilla" slice notation:

file['test'][0:300000]

or, for example, reading every other element:

file['test'][0:300000:2]

Simple slicing (slice objects and single integer indices) should be very fast, as it translates directly into HDF5 hyperslab selections.

The expression file['test'][range(300000)] invokes h5py's version of "fancy indexing", namely, indexing via an explicit list of indices. There's no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has abysmal performance when the lists are > 1000 elements. Likewise for file['test'][np.arange(300000)], which is interpreted in the same way.

Andrew Collette

Related questions
                            
                                URL encode a non-value pair in Python
                            
                                How can I force cherrypy to accept a variable number of GET parameters?
                            
                                SOAP suds and the dreaded schema Type Not Found error
                            
                                Making Django Readonly ForeignKey Field in Admin Render as a Link
                            
                                Convert base64 to Image in Python
                            
                                Real-time operating via Python
                            
                                Tkinter binding a function with arguments to a widget
                            
                                @csrf_exempt stopped working in Django 1.4
                            
                                How to *change* a struct_time object?
                            
                                Python os.environ["HOME"] works on idle but not in a script
                            
                                Python Turtle, draw text with on screen with larger font
                            
                                Python Flask WTForms: How can I disable a field dynamically in a view?
                            
                                live updating with matplotlib
                            
                                XML Declaration standalone="yes" lxml
                            
                                ImportError: No module named mpl_toolkits with maptlotlib 1.3.0 and py2exe
                            
                                pandas plot dataframe barplot with colors by category
                            
                                Transparency for Poly3DCollection plot in matplotlib
                            
                                How to read the last MB of a very large text file
                            
                                Python - How to save functions
                            
                                how to align text to the left?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With