I'm trying to figure out why this happens: <pre class="prettyprint"><code>In [1]: import time, h5py as h5 In [2]: f = h5.File('myfile.hdf5', 'r') In [3]: st = time.time(); data = f["data"].value[0,:,1,...]; elapsed = time.time() - st; In [4]: elapsed Out[4]: 11.127676010131836 In [5]: st = time.time(); data = f["data"][0,:,1,...]; elapsed2 = time.time() - st; In [6]: elapsed2 Out[6]: 59.810582399368286 In [7]: f["data"].shape Out[7]: (1, 4096, 6, 16, 16, 16, 16) In [8]: f["data"].chunks Out[8]: (1, 4096, 1, 16, 16, 16, 16) </code></pre> As you can see, loading the whole dataset into memory and then taking a slice is faster than taking that same slice from the dataset. The chunk size matches the slice, so it should all be contiguous memory, right? Why then is it so much slower? The dataset is compressed with gzip (<code>opts=2</code>). Following Andrew's comment, I run it clearing the caches between both reads: <pre class="prettyprint"><code>elapsed1: 11.001180410385132 elapsed2: 43.19723725318909 48.61user 4.45system 0:54.65elapsed 97%CPU (0avgtext+0avgdata 8431596maxresident)k 479584inputs+0outputs (106major+3764414minor)pagefaults 0swaps </code></pre> (This next run had a 10 second delay between the two reads to clear the caches) <pre class="prettyprint"><code>elapsed1: 11.46790862083435 elapsed2: 43.438515186309814 48.54user 4.66system 1:05.71elapsed 80%CPU (0avgtext+0avgdata 8431944maxresident)k 732504inputs+0outputs (220major+3764449minor)pagefaults 0swaps </code></pre>

First I ran a test of my own. I don't have your HDF5 file, so using one of my test files. My test Table dataset has ~54,000 rows (which seems larger than yours). Timing result with .value[] gives <pre class="prettyprint"><code>>>> elapsed 0.15540122985839844 </code></pre> Timing result with NumPy indexing gives: <pre class="prettyprint"><code>>>> elapsed2 0.12980079650878906 </code></pre> So, I don't see much difference in performance. Maybe it's related to the dataset sizes we are testing, or complexity of data tables? A little reading of the the most recent h5py documentation has some interesting comments about <code>Dataset.value</code> (from Release 2.8.0 - Jun 05, 2018; emphasis mine): Dataset.value property is now deprecated. The property <code>Dataset.value</code>, which dates back to h5py 1.0, is deprecated and will be removed in a later release. This property dumps the entire dataset into a NumPy array. Code using <code>.value</code> should be updated to use NumPy indexing, using <code>mydataset[...]</code> or <code>mydataset[()]</code> as appropriate. Your timing tests seem to be counter to the highlighted observation above. I think you need to ask a h5py developer to comment on the performance differences (and where data is stored -- in memory vs on disk). Have you checked with the h5py user group? Edit: After posting, I found this SO Q&A. It has lots of good comments and includes responses from the h5py developer: h5py: Correct way to slice array datasets

Why is it faster to read whole hdf5 dataset than a slice

Tags:

python

io

hdf5

h5py

I'm trying to figure out why this happens:

In [1]: import time, h5py as h5
In [2]: f = h5.File('myfile.hdf5', 'r')                                                                                                                                    
In [3]: st = time.time(); data = f["data"].value[0,:,1,...]; elapsed = time.time() - st;
In [4]: elapsed
Out[4]: 11.127676010131836
In [5]: st = time.time(); data = f["data"][0,:,1,...]; elapsed2 = time.time() - st;
In [6]: elapsed2
Out[6]: 59.810582399368286
In [7]: f["data"].shape
Out[7]: (1, 4096, 6, 16, 16, 16, 16)
In [8]: f["data"].chunks
Out[8]: (1, 4096, 1, 16, 16, 16, 16)

As you can see, loading the whole dataset into memory and then taking a slice is faster than taking that same slice from the dataset.

The chunk size matches the slice, so it should all be contiguous memory, right? Why then is it so much slower?

The dataset is compressed with gzip (opts=2).

Following Andrew's comment, I run it clearing the caches between both reads:

elapsed1: 11.001180410385132
elapsed2: 43.19723725318909
48.61user 4.45system 0:54.65elapsed 97%CPU (0avgtext+0avgdata 8431596maxresident)k
479584inputs+0outputs (106major+3764414minor)pagefaults 0swaps

(This next run had a 10 second delay between the two reads to clear the caches)

elapsed1: 11.46790862083435
elapsed2: 43.438515186309814

48.54user 4.66system 1:05.71elapsed 80%CPU (0avgtext+0avgdata 8431944maxresident)k
732504inputs+0outputs (220major+3764449minor)pagefaults 0swaps

287

asked Nov 24 '18 13:11

mjgalindo

1 Answers

First I ran a test of my own. I don't have your HDF5 file, so using one of my test files. My test Table dataset has ~54,000 rows (which seems larger than yours).
Timing result with .value[] gives

>>> elapsed
0.15540122985839844

Timing result with NumPy indexing gives:

>>> elapsed2
0.12980079650878906

So, I don't see much difference in performance. Maybe it's related to the dataset sizes we are testing, or complexity of data tables?

A little reading of the the most recent h5py documentation has some interesting comments about Dataset.value (from Release 2.8.0 - Jun 05, 2018; emphasis mine):
Dataset.value property is now deprecated.
The property Dataset.value, which dates back to h5py 1.0, is deprecated and will be removed in a later release. This property dumps the entire dataset into a NumPy array. Code using .value should be updated to use NumPy indexing, using mydataset[...] or mydataset[()] as appropriate.

Your timing tests seem to be counter to the highlighted observation above.

I think you need to ask a h5py developer to comment on the performance differences (and where data is stored -- in memory vs on disk). Have you checked with the h5py user group?

Edit: After posting, I found this SO Q&A. It has lots of good comments and includes responses from the h5py developer:
h5py: Correct way to slice array datasets

179

answered Nov 14 '22 23:11

kcw78

Related questions
                            
                                Python - Get list of all attributes/properties of a win32com class
                            
                                Select Multilines using Lasso Tool
                            
                                Passing arguments to cell magic %%script
                            
                                Scrapy process less than succesfully crawled
                            
                                Whatsapp Automated Bot not able to search in WhatsApp Contact List
                            
                                Correctly setting up Flask-SQLAlchemy for multiple celery workers and threads
                            
                                Passing OpenCv Mat from C++ to Python
                            
                                nested json to pandas very slow
                            
                                deeplab Restoring from checkpoint failed when training on own dataset
                            
                                How to find which TensorFlow is installed in my windows system? Whether it is CPU or GPU TensorFlow
                            
                                In Tensorflow, when use dataset.shuffle(1000), am I only using 1000 data from my whole dataset?
                            
                                How to use `transform_graph` in Tensorflow
                            
                                How should i find the numeric columns in a dataframe which also contain Null values?
                            
                                restore Tensorflow model without extracting from directory
                            
                                Problem renaming all HDF5 datasets in group for large hdf5 files
                            
                                Implementing a batch dependent loss in Keras
                            
                                The flask host adress in docker run
                            
                                Pyspark and local variables inside UDFs
                            
                                Slow loading SQL Server table into pandas DataFrame
                            
                                Get sqlalchemy base class object instead of children

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With