I'm currently exploring HDF5. I've read the interesting comments from the thread "Evaluating HDF5" and I understand that HDF5 is a solution of choice for storing the data, but how do you query it ? For example, say I've a big file containing some identifiers : Is there a way to quickly know if a given identifier is present in the file ?
I think the answer is "not directly".
Here are some of the ways I think you could achieve the functionality.
Use groups:
A hierarchy of groups could be used in the form of a Radix Tree to store the data. This probably doesn't scale too well though.
Use index datasets:
HDF has a reference type which could be used to link to a main table from a separate index tables. After writing the main data, other datasets sorted on other keys with references can be used. For example:
MainDataset (sorted on identifier)
0: { A, "C", 2 }
1: { B, "B", 1 }
2: { C, "A", 3 }
StringIndex
0: { "A", Reference ("MainDataset", 2) }
1: { "B", Reference ("MainDataset", 1) }
2: { "C", Reference ("MainDataset", 0) }
IntIndex
0: { 1, Reference ("MainDataset", 1) }
1: { 2, Reference ("MainDataset", 0) }
2: { 3, Reference ("MainDataset", 2) }
In order to use the above a binary search will have to be written when looking up the field in the Index tables.
In memory Index:
Depending on the size of the dataset it may be just as easy to use an in memory index that is read/written to its own dataset using something like "boost::serialize".
HDF5-FastQuery:
This paper (and also this page) describe the use of bitmap indices to perform complex queries over a HDF dataset. I have not tried this.
H5Lexists was introduced for this in HDF5 1.8.0:
http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Exists
You can also iterate over the things that are in an HDF5 file with H5Literate:
http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Iterate
But you can also manually check for previous versions by trying to open a dataset. We use code like this to deal with any version of HDF5:
bool DoesDatasetExist(const std::string& rDatasetName)
{
#if H5_VERS_MAJOR>=1 && H5_VERS_MINOR>=8
// This is a nice method for testing existence, introduced in HDF5 1.8.0
htri_t dataset_status = H5Lexists(mFileId, rDatasetName.c_str(), H5P_DEFAULT);
return (dataset_status>0);
#else
bool result=false;
// This is not a nice way of doing it because the error stack produces a load of 'HDF failed' output.
// The "TRY" macros are a convenient way to temporarily turn the error stack off.
H5E_BEGIN_TRY
{
hid_t dataset_id = H5Dopen(mFileId, rDatasetName.c_str());
if (dataset_id>0)
{
H5Dclose(dataset_id);
result = true;
}
}
H5E_END_TRY;
return result;
#endif
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With