pytables writes much faster than h5py. Why?

Tags:

I noticed that writing .h5 files takes much longer if I use the h5py library instead of the pytables library. What is the reason? This is also true when the shape of the array is known before. Further, i use the same chunksize and no compression filter.

The following script:

import h5py
import tables
import numpy as np
from time import time

dim1, dim2 = 64, 1527416

# append columns
print("PYTABLES: append columns")
print("=" * 32)
f = tables.open_file("/tmp/test.h5", "w")
a = f.create_earray(f.root, "time_data", tables.Float32Atom(), shape=(0, dim1))
t1 = time()
zeros = np.zeros((1, dim1), dtype="float32")
for i in range(dim2):
    a.append(zeros)
tcre = round(time() - t1, 3)
thcre = round(dim1 * dim2 * 4 / (tcre * 1024 * 1024), 1)
print("Time to append %d columns: %s sec (%s MB/s)" % (i+1, tcre, thcre))
print("=" * 32)
chunkshape = a.chunkshape
f.close()

print("H5PY: append columns")
print("=" * 32)
f = h5py.File(name="/tmp/test.h5",mode='w')
a = f.create_dataset(name='time_data',shape=(0, dim1),
                     maxshape=(None,dim1),dtype='f',chunks=chunkshape)
t1 = time()
zeros = np.zeros((1, dim1), dtype="float32")
samplesWritten = 0
for i in range(dim2):
    a.resize((samplesWritten+1, dim1))
    a[samplesWritten:(samplesWritten+1),:] = zeros
    samplesWritten += 1
tcre = round(time() - t1, 3)
thcre = round(dim1 * dim2 * 4 / (tcre * 1024 * 1024), 1)
print("Time to append %d columns: %s sec (%s MB/s)" % (i+1, tcre, thcre))
print("=" * 32)
f.close()

returns on my computer:

PYTABLES: append columns
================================
Time to append 1527416 columns: 22.679 sec (16.4 MB/s)
================================
H5PY: append columns
================================
Time to append 1527416 columns: 158.894 sec (2.3 MB/s)
================================

If I flush after every for loop, like:

for i in range(dim2):
    a.append(zeros)
    f.flush()

I get:

PYTABLES: append columns
================================
Time to append 1527416 columns: 67.481 sec (5.5 MB/s)
================================
H5PY: append columns
================================
Time to append 1527416 columns: 193.644 sec (1.9 MB/s)
================================

604

asked Sep 16 '19 09:09

adku1173

1 Answers

This is an interesting comparison of PyTables and h5py write performance. Typically I use them to read HDF5 files (and usually with a few reads of large datasets), so haven't noticed this difference. My thoughts align with @max9111: that performance should improve as the number of write operations decreased as the size of the written dataset increased. To that end, I reworked your code to write N lines of data using fewer loops. (Code is at the end).
Results were surprising (to me). Key findings:
1. Total time to write all of the data was a linear function of the # of loops (for both PyTables and h5py).
2. The performance difference between PyTables and h5py only improved slightly as dataset I/O size increased.
3. Pytables was 5.4x faster writing 1 row at a time (1,527,416 writes), and was 3.5x faster writing 88 rows at a time (17,357 writes).

Here is a plot comparing performance.
Chart of Performance Comparison Chart with values for table above.
Performance Benchmark Data

Also, I noticed your code comments say "append columns", but you are extending the first dimension (rows of a HDF5 table/dataset). I rewrote your code to test performance extending the second dimension (adding columns to the HDF5 file), and saw very similar performance.

Initially I thought the I/O bottleneck was due to resizing the datasets. So, I rewrote the example to initially size the array to hold the all rows. This did NOT improve performance (and significantly degraded h5py performance). That was very surprising. Not sure what to make of it.

Here is my example. It uses 3 variables that size the array (as data is added):

cdim: # of columns (fixed)
row_loops: # of write loops
block_size: size of data block written on each loop
row_loops*block_size = total number of rows written

I also made a small change to the add Ones instead of Zeros (to verify data was written, and moved it to the top (and out of the timing loops).

My code here:

import h5py
import tables
import numpy as np
from time import time

cdim, block_size, row_loops = 64, 4, 381854 
vals = np.ones((block_size, cdim), dtype="float32")

# append rows
print("PYTABLES: append rows: %d blocks with: %d rows" % (row_loops, block_size))
print("=" * 32)
f = tables.open_file("rowapp_test_tb.h5", "w")
a = f.create_earray(f.root, "time_data", atom=tables.Float32Atom(), shape=(0, cdim))
t1 = time()
for i in range(row_loops):
    a.append(vals)
tcre = round(time() - t1, 3)
thcre = round(cdim * block_size * row_loops * 4 / (tcre * 1024 * 1024), 1)
print("Time to append %d rows: %s sec (%s MB/s)" % (block_size * row_loops, tcre, thcre))
print("=" * 32)
chunkshape = a.chunkshape
f.close()

print("H5PY: append rows %d blocks with: %d rows" % (row_loops, block_size))
print("=" * 32)
f = h5py.File(name="rowapp_test_h5.h5",mode='w')
a = f.create_dataset(name='time_data',shape=(0, cdim),
                     maxshape=(block_size*row_loops,cdim),
                     dtype='f',chunks=chunkshape)
t1 = time()
samplesWritten = 0
for i in range(row_loops):
    a.resize(((i+1)*block_size, cdim))
    a[samplesWritten:samplesWritten+block_size] = vals
    samplesWritten += block_size
tcre = round(time() - t1, 3)
thcre = round(cdim * block_size * row_loops * 4 / (tcre * 1024 * 1024), 1)
print("Time to append %d rows: %s sec (%s MB/s)" % (block_size * row_loops, tcre, thcre))
print("=" * 32)
f.close()

answered Nov 03 '22 10:11

kcw78

Related questions
                            
                                Sorting pandas dataframe by weekdays
                            
                                numpy find the max value in a row and return back to it's column index
                            
                                How to debug Tensorflow segmentation fault in model.fit()?
                            
                                Difference between multiprocessing.cpu_count and os.cpu_count
                            
                                What does the 'm' in a Python ABI tag mean?
                            
                                What is the difference between MLP implementation from scratch and in PyTorch?
                            
                                How to redirect -progress option output of ffmpeg to stderr?
                            
                                How to add calculated column to Dataframe counting frequency in column in pandas
                            
                                What is a time complexity of move_to_end operation for OrderedDict in Python 3?
                            
                                Multivariate polynomial regression with Python
                            
                                How to join nearby bounding boxes in OpenCV Python
                            
                                How to plot a vertical line at the x-axis range median position using plotly in Python API?
                            
                                Count of values grouped per month, year - Pandas
                            
                                Python: Dynamically import module's code from string with importlib
                            
                                gyp ERR! stack Error: Can't find Python executable
                            
                                Multiple aggregated Counting in Pandas
                            
                                How to keep only the consecutive values in a Pandas dataframe using Python
                            
                                AttributeError: module 'torch' has no attribute '_six'. Bert model in Pytorch
                            
                                Snakemake using a rule in a loop
                            
                                How do I upload to a shared drive in Python with Google Drive API v3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pytables writes much faster than h5py. Why?

Tags:

python

h5py

pytables

adku1173

People also ask

1 Answers

kcw78

Recent Activity

Donate For Us