Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching a HDF5 dataset

Tags:

search

hdf5

I'm currently exploring HDF5. I've read the interesting comments from the thread "Evaluating HDF5" and I understand that HDF5 is a solution of choice for storing the data, but how do you query it ? For example, say I've a big file containing some identifiers : Is there a way to quickly know if a given identifier is present in the file ?

like image 270
Pierre Avatar asked Nov 06 '09 10:11

Pierre


2 Answers

I think the answer is "not directly".

Here are some of the ways I think you could achieve the functionality.

Use groups:

A hierarchy of groups could be used in the form of a Radix Tree to store the data. This probably doesn't scale too well though.

Use index datasets:

HDF has a reference type which could be used to link to a main table from a separate index tables. After writing the main data, other datasets sorted on other keys with references can be used. For example:

MainDataset (sorted on identifier)
0: { A, "C", 2 }
1: { B, "B", 1 }
2: { C, "A", 3 }

StringIndex
0: { "A", Reference ("MainDataset", 2) }
1: { "B", Reference ("MainDataset", 1) }
2: { "C", Reference ("MainDataset", 0) }

IntIndex
0: { 1, Reference ("MainDataset", 1) }
1: { 2, Reference ("MainDataset", 0) }
2: { 3, Reference ("MainDataset", 2) }

In order to use the above a binary search will have to be written when looking up the field in the Index tables.

In memory Index:

Depending on the size of the dataset it may be just as easy to use an in memory index that is read/written to its own dataset using something like "boost::serialize".

HDF5-FastQuery:

This paper (and also this page) describe the use of bitmap indices to perform complex queries over a HDF dataset. I have not tried this.

like image 151
Richard Corden Avatar answered Sep 20 '22 16:09

Richard Corden


H5Lexists was introduced for this in HDF5 1.8.0:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Exists

You can also iterate over the things that are in an HDF5 file with H5Literate:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Iterate

But you can also manually check for previous versions by trying to open a dataset. We use code like this to deal with any version of HDF5:

bool DoesDatasetExist(const std::string& rDatasetName)
{
#if H5_VERS_MAJOR>=1 && H5_VERS_MINOR>=8
    // This is a nice method for testing existence, introduced in HDF5 1.8.0
    htri_t dataset_status = H5Lexists(mFileId, rDatasetName.c_str(), H5P_DEFAULT);
    return (dataset_status>0);
#else
    bool result=false;
    // This is not a nice way of doing it because the error stack produces a load of 'HDF failed' output.
    // The "TRY" macros are a convenient way to temporarily turn the error stack off.
    H5E_BEGIN_TRY
    {
        hid_t dataset_id = H5Dopen(mFileId, rDatasetName.c_str());
        if (dataset_id>0)
        {
            H5Dclose(dataset_id);
            result = true;
        }
    }
    H5E_END_TRY;
    return result;
#endif
}
like image 34
mirams Avatar answered Sep 23 '22 16:09

mirams