Fastest way to write HDF5 files with Python?

Tags:

Given a large (10s of GB) CSV file of mixed text/numbers, what is the fastest way to create an HDF5 file with the same content, while keeping the memory usage reasonable?

I'd like to use the h5py module if possible.

In the toy example below, I've found an incredibly slow and incredibly fast way to write data to HDF5. Would it be best practice to write to HDF5 in chunks of 10,000 rows or so? Or is there a better way to write a massive amount of data to such a file?

import h5py  n = 10000000 f = h5py.File('foo.h5','w') dset = f.create_dataset('int',(n,),'i')  # this is terribly slow for i in xrange(n):   dset[i] = i  # instantaneous dset[...] = 42

265

asked Mar 29 '11 01:03

Nicholas Palko

1 Answers

I would avoid chunking the data and would store the data as series of single array datasets (along the lines of what Benjamin is suggesting). I just finished loading the output of an enterprise app I've been working on into HDF5, and was able to pack about 4.5 Billion compound datatypes as 450,000 datasets, each containing a 10,000 array of data. Writes and reads now seem fairly instantaneous, but were painfully slow when I initially tried to chunk the data.

Just a thought!

Update:

These are a couple of snippets lifted from my actual code (I'm coding in C vs. Python, but you should get the idea of what I'm doing) and modified for clarity. I'm just writing long unsigned integers in arrays (10,000 values per array) and reading them back when I need an actual value

This is my typical writer code. In this case, I'm simply writing long unsigned integer sequence into a sequence of arrays and loading each array sequence into hdf5 as they are created.

//Our dummy data: a rolling count of long unsigned integers long unsigned int k = 0UL; //We'll use this to store our dummy data, 10,000 at a time long unsigned int kValues[NUMPERDATASET]; //Create the SS adata files. hid_t ssdb = H5Fcreate(SSHDF, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT); //NUMPERDATASET = 10,000, so we get a 1 x 10,000 array hsize_t dsDim[1] = {NUMPERDATASET}; //Create the data space. hid_t dSpace = H5Screate_simple(1, dsDim, NULL); //NUMDATASETS = MAXSSVALUE / NUMPERDATASET, where MAXSSVALUE = 4,500,000,000 for (unsigned long int i = 0UL; i < NUMDATASETS; i++){     for (unsigned long int j = 0UL; j < NUMPERDATASET; j++){         kValues[j] = k;         k += 1UL;     }     //Create the data set.     dssSet = H5Dcreate2(ssdb, g_strdup_printf("%lu", i), H5T_NATIVE_ULONG, dSpace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);     //Write data to the data set.     H5Dwrite(dssSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, kValues);     //Close the data set.     H5Dclose(dssSet); } //Release the data space H5Sclose(dSpace); //Close the data files. H5Fclose(ssdb);

This is a slightly modified version of my reader code. There are more elegant ways of doing this (i.e., I could use hyperplanes to get the value), but this was the cleanest solution with respect to my fairly disciplined Agile/BDD development process.

unsigned long int getValueByIndex(unsigned long int nnValue){     //NUMPERDATASET = 10,000     unsigned long int ssValue[NUMPERDATASET];     //MAXSSVALUE = 4,500,000,000; i takes the smaller value of MAXSSVALUE or nnValue     //to avoid index out of range error      unsigned long int i = MIN(MAXSSVALUE-1,nnValue);     //Open the data file in read-write mode.     hid_t db = H5Fopen(_indexFilePath, H5F_ACC_RDONLY, H5P_DEFAULT);     //Create the data set. In this case, each dataset consists of a array of 10,000     //unsigned long int and is named according to its integer division value of i divided     //by the number per data set.     hid_t dSet = H5Dopen(db, g_strdup_printf("%lu", i / NUMPERDATASET), H5P_DEFAULT);     //Read the data set array.     H5Dread(dSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, ssValue);     //Close the data set.     H5Dclose(dSet);     //Close the data file.     H5Fclose(db);     //Return the indexed value by using the modulus of i divided by the number per dataset     return ssValue[i % NUMPERDATASET]; }

The main take-away is the inner loop in the writing code and the integer division and mod operations to get the index of the dataset array and index of the desired value in that array. Let me know if this is clear enough so you can put together something similar or better in h5py. In C, this is dead simple and gives me significantly better read/write times vs. a chunked dataset solution. Plus since I can't use compression with compound datasets anyway, the apparent upside of chunking is a moot point, so all my compounds are stored the same way.

190

answered Sep 20 '22 05:09

Marc

Related questions
                            
                                Are automatically generated GUIDs for types in .NET consistent?
                            
                                c++: logger class without globals or singletons or passing it to every method
                            
                                Python 2.6 TreeMap/SortedDictionary?
                            
                                Why would I ever NOT use BitmapFactory's inPurgeable option?
                            
                                preventing <nav> to appear as "untitled section" on html5 websites
                            
                                Is it legal to modify the result of std::string::op[]?
                            
                                Forward compatible Java 6 annotation processor and SupportedSourceVersion
                            
                                Is namespace-`static` still deprecated in C++11? [duplicate]
                            
                                How to send data from one android device to another?
                            
                                How to get a screen reader to stop reading and read different content
                            
                                Why does gcc allow extern declarations of type void (non-pointer)?
                            
                                ECDSA signing file with key from store C#.Net CNG

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With