Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading hdf5 files into python xarrays

The python module xarray greatly supports loading/mapping netCDF files, even lazily with dask.

The data source I have to work with are thousands of hdf5 files, with lots of groups, datasets, attributes - all created with h5py.

The Question is: How can I load (or even better with dask, lazily map) hdf5 data (datasets, metadata,...) into an xarray dataset structure?

Has anybody experience with that or came across a similar issue? Thank you!

like image 466
fmfreeze Avatar asked Feb 11 '19 11:02

fmfreeze


1 Answers

One possible solution to this is to open the hdf5-file using netCDF4 in diskless non-persistence mode:

ncf = netCDF4.Dataset(hdf5file, diskless=True, persist=False)

Now you can inspect the file contents including groups.

After that you can make use of xarray.backends.NetCDF4DataStore to open the wanted hdf5-groups (xarray can only get hold of one hdf5-group at a time):

nch = ncf.groups.get('hdf5-name')
xds = xarray.open_dataset(xarray.backends.NetCDF4DataStore(nch))

This will give you a dataset xds with all attributes and variables (datasets) of the group hdf5-name. Note that you will not get access to sub-groups. You would need to claim subgroups by the same mechanism. If you want to apply dask, you would need to add the keyword chunking with wanted values.

There is no (real) automatism for decoding data like this could be done for NetCDF files. If you have a integer compressed 2d variable (dataset) var with some attributes gain and offset you can add the NetCDF specific attributes scale_factor and add_offset to the variable:

var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
ds = xarray.decode_cf(xds)

This will decode your variable using netcdf mechanisms.

Additionally you could try to give the extracted dimension useful names (you will get something like phony_dim_0, phony_dim_1, ..., phony_dim_N) and assign new (as in example) or existing variables/coordinates to those dimensions to gain as much of the xarray machinery:

var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
dims = var.dims
xds['var'] = var.rename({dims[0]: 'x', dims[1]: 'y'})
xds = xds.assign({'x': (['x'], xvals, xattrs)})
xds = xds.assign({'y': (['y'], yvals, yattrs)})
ds = xarray.decode_cf(xds)

References:

  • netCDF4 Dataset
  • xarray.backends.NetCDF4DataStore
  • xarray.decode_cf
like image 187
kmuehlbauer Avatar answered Oct 26 '22 14:10

kmuehlbauer