Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing & Appending arrays of float to the only dataset in hdf5 file in C++

Tags:

I am processing number of files, each processing of the file will output several thousand of arrays of float and I will store the data of all files in one huge dataset in a single hdf5 for further processing.

The thing is currently I am confused about how to append my data into the hdf5 file. (comment in the code above) In 2 for loops above, as you can see, I want to append 1 dimensional array of float into hdf5 at a time, and not as the whole thing. My data is in terabytes, and we can only append the data into the file.

There are several questions:

  1. How to append the data in this case? What kind of function must I use?
  2. Right now, I have fdim[0] = 928347543, I have tried put infinity flag of HDF5 in, but the runtime execution complains. Is there a way to do this? I don't want to calculate the data that I have each time; is there a way to just simply keep on adding data in, without caring the value of fdim?

Or is this not possible?

EDIT:

I've been following Simon's suggestion, and currently here is my updated code:

hid_t desFi5;
hid_t fid1;
hid_t propList;
hsize_t fdim[2];

desFi5 = H5Fcreate(saveFilePath, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

fdim[0] = 3;
fdim[1] = 1;//H5S_UNLIMITED;

fid1 = H5Screate_simple(2, fdim, NULL);

cout << "----------------------------------Space done\n";

propList = H5Pcreate( H5P_DATASET_CREATE);

H5Pset_layout( propList, H5D_CHUNKED );

int ndims = 2;
hsize_t chunk_dims[2];
chunk_dims[0] = 3;
chunk_dims[1] = 1;

H5Pset_chunk( propList, ndims, chunk_dims );

cout << "----------------------------------Property done\n";

hid_t dataset1 = H5Dcreate( desFi5, "des", H5T_NATIVE_FLOAT, fid1, H5P_DEFAULT, propList, H5P_DEFAULT);

cout << "----------------------------------Dataset done\n";

bufi = new float*[1];
bufi[0] = new float[3];
bufi[0][0] = 0;
bufi[0][1] = 1;
bufi[0][2] = 2;

//hyperslab
hsize_t start[2] = {0,0};
hsize_t stride[2] = {1,1};
hsize_t count[2] = {1,1};
hsize_t block[2] = {1,3};

H5Sselect_hyperslab( fid1, H5S_SELECT_OR, start, stride, count, block);     
cout << "----------------------------------hyperslab done\n";   

H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, *bufi);

fdim[0] = 3;
fdim[1] = H5S_UNLIMITED;    // COMPLAINS HERE
H5Dset_extent( dataset1, fdim );

cout << "----------------------------------extent done\n";

//hyperslab2
hsize_t start2[2] = {1,0};
hsize_t stride2[2] = {1,1};
hsize_t count2[2] = {1,1};
hsize_t block2[2] = {1,3};

H5Sselect_hyperslab( fid1, H5S_SELECT_OR, start2, stride2, count2, block2);     
cout << "----------------------------------hyperslab2 done\n";  

H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, *bufi);

cout << "----------------------------------H5Dwrite done\n";        
H5Dclose(dataset1);
cout << "----------------------------------dataset closed\n";   
H5Pclose( propList );   
cout << "----------------------------------property list closed\n"; 
H5Sclose(fid1); 
cout << "----------------------------------dataspace fid1 closed\n";    
H5Fclose(desFi5);       
cout << "----------------------------------desFi5 closed\n";    

My current output is:

bash-3.2$ ./hdf5AppendTest.out
----------------------------------Space done
----------------------------------Property done
----------------------------------Dataset done
----------------------------------hyperslab done
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5D.c line 1103 in H5Dset_extent(): unable to set extend dataset
    major: Dataset
    minor: Unable to initialize object
  #001: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dint.c line 2179 in H5D__set_extent(): unable to modify size of data space
    major: Dataset
    minor: Unable to initialize object
  #002: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5S.c line 1874 in H5S_set_extent(): dimension cannot exceed the existing maximal size (new: 18446744073709551615 max: 1)
    major: Dataspace
    minor: Bad value
----------------------------------extent done
----------------------------------hyperslab2 done
----------------------------------H5Dwrite done
----------------------------------dataset closed
----------------------------------property list closed
----------------------------------dataspace fid1 closed
----------------------------------desFi5 closed

Currently, I see that setting things in unlimited with H5Dset_extent still causes a problem during runtime. (problematic function is marked with //COMPLAINS HERE in the code above.) I already got a chunk data as specified by Simon, so what's the problem here?

On the other hand, without H5Dset_extent, I can write a test array of [0, 1, 2] just fine, but how can we make the code above the output the test array to the file like this:

[0, 1, 2]
[0, 1, 2]
[0, 1, 2]
[0, 1, 2]
...
...

Recall: this is just a test array, the real data is bigger, and I cannot hold the whole thing in the RAM, so I must put data in part by part one at a time.

EDIT 2:

I've followed more of Simon's suggestion. Here is the critical part:

hsize_t n = 3, p = 1;
float *bufi_data = new float[n * p];
float ** bufi = new float*[n];
for (hsize_t i = 0; i < n; ++i){
    bufi[i] = &bufi_data[i * n];
}

bufi[0][0] = 0.1;
bufi[0][1] = 0.2;
bufi[0][2] = 0.3;

//hyperslab
hsize_t start[2] = {0,0};
hsize_t count[2] = {3,1};

H5Sselect_hyperslab( fid1, H5S_SELECT_SET, start, NULL, count, NULL);
cout << "----------------------------------hyperslab done\n";   

H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, fid1, H5P_DEFAULT, *bufi);

bufi[0][0] = 0.4;
bufi[0][1] = 0.5;
bufi[0][2] = 0.6;

hsize_t fdimNew[2];
fdimNew[0] = 3;
fdimNew[1] = 2;
H5Dset_extent( dataset1, fdimNew );

cout << "----------------------------------extent done\n";

//hyperslab2
hsize_t start2[2] = {0,0}; //PROBLEM
hsize_t count2[2] = {3,1};

H5Sselect_hyperslab( fid1, H5S_SELECT_SET, start2, NULL, count2, NULL);     
cout << "----------------------------------hyperslab2 done\n";  

H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, fid1, H5P_DEFAULT, *bufi);

From the above, I got the following output for hdf5:

0.4 0.5 0.6
  0   0   0

After further experiment with start2 and count2, I see these variables only affect starting index and incrementing index for bufi. It does not move the position of the writing index of my dataset at all.

Recall: the final result must be:

0.1 0.2 0.3
0.4 0.5 0.6

Also, it must be bufi instead of *bufi for H5Dwrite, Simon, because bufi gives me completely random numbers.

UPDATE 3:

For the selection part suggested by Simon:

hsize_t start[2] = {0, 0};
hsize_t count[2] = {1, 3};

hsize_t start[2] = {1, 0};
hsize_t count[2] = {1, 3};

These will give out the following error:

HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dio.c line 245 in H5Dwrite(): file selection+offset not within extent
    major: Dataspace
    minor: Out of range

count[2] should be {3,1}, rather than {1,3}, I suppose? And for start[2], if I don't set it as {0,0}, it will always yell out the error above.

Are you sure this is correct?

like image 912
Karl Avatar asked Mar 13 '13 07:03

Karl


People also ask

What you mean by writing?

"Writing" is the process of using symbols (letters of the alphabet, punctuation and spaces) to communicate thoughts and ideas in a readable form. "Writing" can also refer to the work/career of an author, as in: "Shakespeare didn't make much money from writing."

Is it writing or writting?

Writing is the right word; writting is the misspelling of writing .

What is writing in English studies?

Writing is a medium of human communication that involves the representation of a language through a system of physically inscribed, mechanically transferred, or digitally represented symbols.


1 Answers

How to append the data in this case? What kind of function must I use?

You must use hyperslabs. That's what you need to write only part of a dataset. The function to do that is H5Sselect_hyperslab. Use it on fd1 and use fd1 as your file dataspace in your H5Dwrite call.

I have tried put infinity flag of HDF5 in, but the runtime execution complains.

You need to create a chunked dataset in order to be able to set its maximum size to infinity. Create a dataset creation property list and use H5Pset_layout to make it chunked. Use H5Pset_chunk to set the chunk size. Then create your dataset using this property list.

I don't want to calculate the data that I have each time; is there a way to just simply keep on adding data in, without caring the value of fdim?

You can do two things:

  1. Precompute the final size so you can create a dataset big enough. It looks like that's what you are doing.

  2. Extend your dataset as you go using H5Dset_extent. For this you need to set the maximum dimensions to infinity so you need a chunked dataset (see above).

In both case, you need to select an hyperslab on the file dataspace in your H5Dwrite call (see above).

Walkthrough working code

#include <iostream>
#include <hdf5.h>

// Constants
const char saveFilePath[] = "test.h5";
const hsize_t ndims = 2;
const hsize_t ncols = 3;

int main()
{

First, create a hdf5 file.

    hid_t file = H5Fcreate(saveFilePath, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
    std::cout << "- File created" << std::endl;

Then create a 2D dataspace. The size of the first dimension is unlimited. We set it initially to 0 to show how you can extend the dataset at each step. You could also set it to the size of the first buffer you are going to write for instance. The size of the second dimension is fixed.

    hsize_t dims[ndims] = {0, ncols};
    hsize_t max_dims[ndims] = {H5S_UNLIMITED, ncols};
    hid_t file_space = H5Screate_simple(ndims, dims, max_dims);
    std::cout << "- Dataspace created" << std::endl;

Then create a dataset creation property list. The layout of the dataset have to be chunked when using unlimited dimensions. The choice of the chunk size affects performances, both in time and disk space. If the chunks are very small, you will have a lot of overhead. If they are too large, you might allocate space that you don't need and your files might end up being too large. This is a toy example so we will choose chunks of one line.

    hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
    H5Pset_layout(plist, H5D_CHUNKED);
    hsize_t chunk_dims[ndims] = {2, ncols};
    H5Pset_chunk(plist, ndims, chunk_dims);
    std::cout << "- Property list created" << std::endl;

Create the dataset.

    hid_t dset = H5Dcreate(file, "dset1", H5T_NATIVE_FLOAT, file_space, H5P_DEFAULT, plist, H5P_DEFAULT);
    std::cout << "- Dataset 'dset1' created" << std::endl;

Close resources. The dataset is now created so we don't need the property list anymore. We don't need the file dataspace anymore because when the dataset will be extended, it will become invalid as it will still hold the previous extent. So we will have to grab the updated file dataspace anyway.

    H5Pclose(plist);
    H5Sclose(file_space);

We will now append two buffers to the end of the dataset. The first one will be two lines long. The second one will be three lines long.

First buffer

We create a 2D buffer (contigous in memory, row major order). We will allocate enough memory to store 3 lines, so we can reuse the buffer. Let us create an array of pointers so we can use the b[i][j] notation instead of buffer[i * ncols + j]. This is purely esthetic.

    hsize_t nlines = 3;
    float *buffer = new float[nlines * ncols];
    float **b = new float*[nlines];
    for (hsize_t i = 0; i < nlines; ++i){
        b[i] = &buffer[i * ncols];
    }

Initial values in buffer to be written in the dataset:

    b[0][0] = 0.1;
    b[0][1] = 0.2;
    b[0][2] = 0.3;
    b[1][0] = 0.4;
    b[1][1] = 0.5;
    b[1][2] = 0.6;

We create a memory dataspace to indicate the size of our buffer in memory. Remember the first buffer is only two lines long.

    dims[0] = 2;
    dims[1] = ncols;
    hid_t mem_space = H5Screate_simple(ndims, dims, NULL);
    std::cout << "- Memory dataspace created" << std::endl;

We now need to extend the dataset. We set the initial size of the dataset to 0x3, we thus need to extend it first. Note that we extend the dataset itself, not its dataspace. Remember the first buffer is only two lines long.

    dims[0] = 2;
    dims[1] = ncols;
    H5Dset_extent(dset, dims);
    std::cout << "- Dataset extended" << std::endl;

Select hyperslab on file dataset.

    file_space = H5Dget_space(dset);
    hsize_t start[2] = {0, 0};
    hsize_t count[2] = {2, ncols};
    H5Sselect_hyperslab(file_space, H5S_SELECT_SET, start, NULL, count, NULL);
    std::cout << "- First hyperslab selected" << std::endl;

Write buffer to dataset. mem_space and file_space should now have the same number of elements selected. Note that buffer and &b[0][0] are equivalent.

    H5Dwrite(dset, H5T_NATIVE_FLOAT, mem_space, file_space, H5P_DEFAULT, buffer);
    std::cout << "- First buffer written" << std::endl;

We can now close the file dataspace. We could close the memory dataspace now and create a new one for the second buffer, but we will simply update its size.

    H5Sclose(file_space);

Second buffer

New values in buffer to be appended to the dataset:

    b[0][0] = 1.1;
    b[0][1] = 1.2;
    b[0][2] = 1.3;
    b[1][0] = 1.4;
    b[1][1] = 1.5;
    b[1][2] = 1.6;
    b[2][0] = 1.7;
    b[2][1] = 1.8;
    b[2][2] = 1.9;

Resize the memory dataspace to indicate the new size of our buffer. The second buffer is three lines long.

    dims[0] = 3;
    dims[1] = ncols;
    H5Sset_extent_simple(mem_space, ndims, dims, NULL);
    std::cout << "- Memory dataspace resized" << std::endl;

Extend dataset. Note that in this simple example, we know that 2 + 3 = 5. In general, you could read the current extent from the file dataspace and add the desired number of lines to it.

    dims[0] = 5;
    dims[1] = ncols;
    H5Dset_extent(dset, dims);
    std::cout << "- Dataset extended" << std::endl;

Select hyperslab on file dataset. Again in this simple example, we know that 0 + 2 = 2. In general, you could read the current extent from the file dataspace. The second buffer is three lines long.

    file_space = H5Dget_space(dset);
    start[0] = 2;
    start[1] = 0;
    count[0] = 3;
    count[1] = ncols;
    H5Sselect_hyperslab(file_space, H5S_SELECT_SET, start, NULL, count, NULL);
    std::cout << "- Second hyperslab selected" << std::endl;

Append buffer to dataset

    H5Dwrite(dset, H5T_NATIVE_FLOAT, mem_space, file_space, H5P_DEFAULT, buffer);
    std::cout << "- Second buffer written" << std::endl;

The end: let's close all the resources:

    delete[] b;
    delete[] buffer;
    H5Sclose(file_space);
    H5Sclose(mem_space);
    H5Dclose(dset);
    H5Fclose(file);
    std::cout << "- Resources released" << std::endl;
}

NB: I removed the previous updates because the answer was too long. If you are interested, browse the history.

like image 63
Simon Avatar answered Sep 23 '22 14:09

Simon