Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deleting information from an HDF5 file

Tags:

python

hdf5

I realize that a SO user has formerly asked this question but it was asked in 2009 and I was hoping that more knowledge of HDF5 was available or newer versions had fixed this particular issue. To restate the question here concerning my own problem;

I have a gigantic file of nodes and elements from a large geometry and have already retrieved all the useful information I need from it. Therefore, in Python, I am trying to keep the original file, but delete the information I do not need and fill in more information for other sources. For example, I have a dataset of nodes that I don't need. However, I need to keep the neighboring dataset and include information about their indices from an outside file. Is there any way to delete these specific datasets?

Or is the old idea of having "placekeepers" in the HDF5 file still holding true, such that no one knows how/bothers with removing info? I'm not too worried about the empty space, as long as it is faster to simply remove and add on information then to create an entirely new file.

Note: I'm using H5py's 'r+' to read and write.

like image 294
Ason Avatar asked Jun 25 '12 18:06

Ason


People also ask

Why is HDF5 file so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

How does HDF5 store data?

HDF5 uses a "file directory" like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer. The HDF5 format also allows for embedding of metadata making it self-describing.

How do I check my HDF5 data?

View Metadata of an HDF5 Object To view the metadata of a data object, Right click on the object and then select 'Show Properties'. A window will open and display metadata information such as name, type, attributes, data type, and data space.

Are HDF5 files compressed?

The HDF5 file format and library provide flexibility to use a variety of data compression filters on individual datasets in an HDF5 file. Compressed data is stored in chunks and automatically uncompressed by the library and filter plugin when a chunk is accessed.


1 Answers

Removing entire nodes (groups or datasets) from a hdf5 file should be no problem.
However if you want to reclaim the space you have to run the h5repack tool.

From the hdf5 docs:

5.5.2. Deleting a Dataset from a File and Reclaiming Space

HDF5 does not at this time provide an easy mechanism to remove a dataset from a file or to reclaim the storage space occupied by a deleted object.

Removing a dataset and reclaiming the space it used can be done with the H5Ldelete function and the h5repack utility program. With the H5Ldelete function, links to a dataset can be removed from the file structure. After all the links have been removed, the dataset becomes inaccessible to any application and is effectively removed from the file. The way to recover the space occupied by an unlinked dataset is to write all of the objects of the file into a new file. Any unlinked object is inaccessible to the application and will not be included in the new file. Writing objects to a new file can be done with a custom program or with the h5repack utility program.

Alternatively you can also have a look into PyTables`s ptrepack tool. PyTables should be able to read h5py hdf5 files and the ptrepack tool is similar to the h5repack.

If you want to remove records from a datasets, then you probably have to retrieve the records you want to keep and create a new dataset and remove the old one.
PyTables supports removing rows, however it's not recommended.

like image 151
Ümit Avatar answered Oct 22 '22 14:10

Ümit