I'd like to append a string to an HDF5 dataset of dimension 1. The following code works for appending doubles to the "doubles" dataset in file test-doubles.h5 but the code segfaults in the dataset.write(str, string_type, mspace, fspace)
call:
#include "H5Cpp.h"
const int RANK = 1;
H5::StrType string_type(H5::PredType::C_S1, H5T_VARIABLE);
void append_double(H5::DataSet &dataset, double value) {
// dataspace
hsize_t dims[RANK] = { 1 };
hsize_t maxdims[RANK] = { H5S_UNLIMITED };
H5::DataSpace mspace(RANK, dims, maxdims);
H5::DataSpace space = dataset.getSpace();
const hsize_t actual_dim = space.getSimpleExtentNpoints();
// extend the dataset
hsize_t new_size[RANK];
new_size[0] = actual_dim + 1;
dataset.extend(new_size);
// select hyperslab.
H5::DataSpace fspace = dataset.getSpace();
hsize_t offset[RANK] = { actual_dim };
hsize_t dims1[RANK] = { 1 };
fspace.selectHyperslab(H5S_SELECT_SET, dims1, offset);
dataset.write(&value, H5::PredType::NATIVE_DOUBLE, mspace, fspace);
}
void append_string(H5::DataSet &dataset, string value) {
// dataspace
hsize_t dims[RANK] = { 1 };
hsize_t maxdims[RANK] = { H5S_UNLIMITED };
H5::DataSpace mspace(RANK, dims, maxdims);
H5::DataSpace space = dataset.getSpace();
const hsize_t actual_dim = space.getSimpleExtentNpoints();
// extend the dataset
hsize_t new_size[RANK];
new_size[0] = actual_dim + 1;
dataset.extend(new_size);
// select hyperslab.
H5::DataSpace fspace = dataset.getSpace();
hsize_t offset[RANK] = { actual_dim };
hsize_t dims1[RANK] = { 1 };
fspace.selectHyperslab(H5S_SELECT_SET, dims1, offset);
const char *str = value.c_str();
dataset.write(str, string_type, mspace, fspace);
}
int main(int argc, char *argv[]) {
cout << "start" << endl;
{
H5::H5File h5_file("test-doubles.h5", H5F_ACC_TRUNC);
// create data space with unlimited dimensions for doubles
hsize_t doubles_dims[RANK] = { 0 };
hsize_t doubles_maxdims[RANK] = { H5S_UNLIMITED };
H5::DataSpace doubles_fspace(RANK, doubles_dims, doubles_maxdims);
// enable chunking for this dataset
H5::DSetCreatPropList cparms;
hsize_t chunk_dims[RANK] = { 1 };
cparms.setChunk(RANK, chunk_dims);
// create dataset for doubles:
H5::DataSet d_dataset = h5_file.createDataSet("doubles",
H5::PredType::NATIVE_DOUBLE, doubles_fspace, cparms);
// append values to this dataset:
append_double(d_dataset, 1.0);
append_double(d_dataset, 2.0);
append_double(d_dataset, 3.0);
cout << "doubles written." << endl;
}
{
H5::H5File h5_file("test-strings.h5", H5F_ACC_TRUNC);
// create data space with unlimited dimensions for strings
hsize_t str_dims[RANK] = { 0 };
hsize_t str_maxdims[RANK] = { H5S_UNLIMITED };
H5::DataSpace str_fspace(RANK, str_dims, str_maxdims);
// enable chunking for this dataset
H5::DSetCreatPropList str_cparms;
hsize_t str_chunk_dims[RANK] = { 1 };
str_cparms.setChunk(RANK, str_chunk_dims);
// create dataset for doubles:
H5::DataSet str_dataset = h5_file.createDataSet("strings", string_type, str_fspace, str_cparms);
// append strings to this dataset:
append_string(str_dataset, "test1");
append_string(str_dataset, "test2");
append_string(str_dataset, "test3");
cout << "strings written." << endl;
}
cout << "all done." << endl;
return 0;
}
Thanks a lot for your help!
String data in HDF5 datasets is read as bytes by default: bytes objects for variable-length strings, or numpy bytes arrays ( 'S' dtypes) for fixed-length strings. Use Dataset.asstr () to retrieve str objects. Variable-length strings in attributes are read as str objects. These are decoded as UTF-8 with surrogate escaping for unrecognised bytes.
An HDF5 attribute is a small metadata object describing the nature and/or intended usage of a primary data object . A primary data object may be a dataset, group, or committed datatype. Attributes are assumed to be very small as data objects go, so storing them as standard HDF5 datasets would be quite inefficient.
Write the contained data to an HDF5 file using HDFStore. Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.
You can use string_dtype () to explicitly specify any HDF5 string datatype. When writing data to an existing dataset or attribute, data passed as bytes is written without checking the encoding. Data passed as Python str objects is encoded as either ASCII or UTF-8, based on the HDF5 datatype.
It all works if you replace
dataset.write(str, string_type, mspace, fspace);
with
dataset.write(&str, string_type, mspace, fspace);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With