Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incremental writes to hdf5 with h5py

Tags:

python

hdf5

h5py

I have got a question about how best to write to hdf5 files with python / h5py.

I have data like:

----------------------------------------- | timepoint | voltage1 | voltage2 | ... ----------------------------------------- | 178       | 10       | 12       | ... ----------------------------------------- | 179       | 12       | 11       | ... ----------------------------------------- | 185       | 9        | 12       | ... ----------------------------------------- | 187       | 15       | 12       | ...                     ... 

with about 10^4 columns, and about 10^7 rows. (That's about 10^11 (100 billion) elements, or ~100GB with 1 byte ints).

With this data, typical use is pretty much write once, read many times, and the typical read case would be to grab column 1 and another column (say 254), load both columns into memory, and do some fancy statistics.

I think a good hdf5 structure would thus be to have each column in the table above be a hdf5 group, resulting in 10^4 groups. That way we won't need to read all the data into memory, yes? The hdf5 structure isn't yet defined though, so it can be anything.

Now the question: I receive the data ~10^4 rows at a time (and not exactly the same numbers of rows each time), and need to write it incrementally to the hdf5 file. How do I write that file?

I'm considering python and h5py, but could another tool if recommended. Is chunking the way to go, with e.g.

dset = f.create_dataset("voltage284", (100000,), maxshape=(None,), dtype='i8', chunks=(10000,)) 

and then when another block of 10^4 rows arrives, replace the dataset?

Or is it better to just store each block of 10^4 rows as a separate dataset? Or do I really need to know the final number of rows? (That'll be tricky to get, but maybe possible).

I can bail on hdf5 if it's not the right tool for the job too, though I think once the awkward writes are done, it'll be wonderful.

like image 961
user116293 Avatar asked Sep 04 '14 00:09

user116293


People also ask

How do I convert a CSV file to HDF5?

If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.: import numpy as np import pandas as pd from IPython. display import clear_output CHUNK_SIZE = 5000000 filename = 'data. csv' dtypes = {'latitude': float, 'longitude': float} iter_csv = pd.

Why is HDF5 file so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

Why is HDF5 fast?

Beyond the things listed above, there's another big advantage to a "chunked"* on-disk data format such as HDF5: Reading an arbitrary slice (emphasis on arbitrary) will typically be much faster, as the on-disk data is more contiguous on average. * (HDF5 doesn't have to be a chunked data format.

Can HDF5 store strings?

Storing stringsYou can use string_dtype() to explicitly specify any HDF5 string datatype.


1 Answers

Per the FAQ, you can expand the dataset using dset.resize. For example,

import os import h5py import numpy as np path = '/tmp/out.h5' os.remove(path) with h5py.File(path, "a") as f:     dset = f.create_dataset('voltage284', (10**5,), maxshape=(None,),                             dtype='i8', chunks=(10**4,))     dset[:] = np.random.random(dset.shape)             print(dset.shape)     # (100000,)      for i in range(3):         dset.resize(dset.shape[0]+10**4, axis=0)            dset[-10**4:] = np.random.random(10**4)         print(dset.shape)         # (110000,)         # (120000,)         # (130000,) 
like image 174
unutbu Avatar answered Sep 29 '22 06:09

unutbu