Problem renaming all HDF5 datasets in group for large hdf5 files

Tags:

I am having a problem renaming datasets in hdf5. The process is EXTREMELY slow. I read some documentation stating that dataset names are merely links to the data, so an acceptable way to rename is:

group['new_name'] = group['old_name']
del group['old_name']

But this is so slow (only 5% complete running overnight), it makes me think my process is entirely wrong.

I'm using python h5py, and here's my slow code:

# Open file
with h5py.File('test.hdf5') as f:

    # Get all top level groups
    top_keys = [key for key in f.keys()]

    # Iterate over each group
    for top_key in top_keys:
        group = f[top_key]
        tot_digits = len(group)

        #Rename all datasets in the group (pad with zeros)
        for key in tqdm(group.keys()):
            new_key = str(key)
            while len(new_key)<tot_digits:
                new_key = '0'+str(new_key)
            group[new_key] = group[key]
            del group[key]

Per @jpp suggestion, I also tried replacing the last two lines with group.move:

group.move(key, new_key)

But this method was equally slow. I have several groups with the same number of datasets, but each group has different size datasets. The group with the largest datasets (most bytes) seem to rename the slowest.

Certainly there is a way to do this quickly. Is the dataset name just a symbolic link? Or does renaming inherently cause the entire dataset to be rewritten? How should I go about renaming many datasets in an HDF5 file?

442

asked Oct 31 '18 14:10

Richard

1 Answers

One possible culprit, at least if you have a large number of groups under your top level keys, is that your are creating the new name in a very inefficient way. Instead of

while len(new_key)<tot_digits:
    new_key = '0'+str(new_key)

You should generate the new key like this:

if len(new_key)<tot_digits:
    new_key = (tot_digits-len(new_key))*'0' + new_key

This way you don't create a new string object for every extra digit you need to add.

It is also possible, although I can't confirm this, that calling group.keys() will return an iterator which will get repopulated with the new key names you add, since you modify the group while iterating over the keys. A standard python iterator would throw a RuntimeError, but it's clear if hf5py would do the same. To be sure you don't have that problem, you can simple make sure you create a list of the keys up-front.

for key in tqdm(list(group.keys())):

answered Oct 14 '22 01:10

ilmarinen

Related questions
                            
                                Install ODBC Driver heroku
                            
                                How can I set boundary of Content-type using python requests?
                            
                                Update Sharepoint 2013 using Python3
                            
                                Python/Pandas - Query a MultiIndex Column [duplicate]
                            
                                LinearNDInterpolator -- Qhull precision error: Initial simplex is flat
                            
                                newrelic agent is not sending data to newrelic servers at staging only
                            
                                Python - Get list of all attributes/properties of a win32com class
                            
                                Select Multilines using Lasso Tool
                            
                                Passing arguments to cell magic %%script
                            
                                Scrapy process less than succesfully crawled
                            
                                Whatsapp Automated Bot not able to search in WhatsApp Contact List
                            
                                Correctly setting up Flask-SQLAlchemy for multiple celery workers and threads
                            
                                Passing OpenCv Mat from C++ to Python
                            
                                nested json to pandas very slow
                            
                                deeplab Restoring from checkpoint failed when training on own dataset
                            
                                How to find which TensorFlow is installed in my windows system? Whether it is CPU or GPU TensorFlow
                            
                                In Tensorflow, when use dataset.shuffle(1000), am I only using 1000 data from my whole dataset?
                            
                                How to use `transform_graph` in Tensorflow
                            
                                How should i find the numeric columns in a dataframe which also contain Null values?
                            
                                restore Tensorflow model without extracting from directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Problem renaming all HDF5 datasets in group for large hdf5 files

Tags:

performance

python

hdf5

h5py

Richard

People also ask

1 Answers

ilmarinen

Recent Activity

Donate For Us