The python module xarray greatly supports loading/mapping netCDF files, even lazily with dask.
The data source I have to work with are thousands of hdf5 files, with lots of groups, datasets, attributes - all created with h5py.
The Question is: How can I load (or even better with dask, lazily map) hdf5 data (datasets, metadata,...) into an xarray dataset structure?
Has anybody experience with that or came across a similar issue? Thank you!
One possible solution to this is to open the hdf5-file using netCDF4 in diskless non-persistence mode:
ncf = netCDF4.Dataset(hdf5file, diskless=True, persist=False)
Now you can inspect the file contents including groups
.
After that you can make use of xarray.backends.NetCDF4DataStore
to open the wanted hdf5-groups (xarray
can only get hold of one hdf5-group at a time):
nch = ncf.groups.get('hdf5-name')
xds = xarray.open_dataset(xarray.backends.NetCDF4DataStore(nch))
This will give you a dataset xds
with all attributes and variables (datasets) of the
group hdf5-name
. Note that you will not get access to sub-groups. You would need to claim subgroups by the same mechanism. If you want to apply dask
, you would need to add the keyword chunking
with wanted values.
There is no (real) automatism for decoding data like this could be done for NetCDF files. If you have a integer compressed 2d variable (dataset) var
with some attributes gain
and offset
you can add the NetCDF specific attributes scale_factor
and add_offset
to the variable:
var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
ds = xarray.decode_cf(xds)
This will decode your variable using netcdf mechanisms.
Additionally you could try to give the extracted dimension useful names (you will get something like phony_dim_0
, phony_dim_1
, ..., phony_dim_N
) and assign new (as in example) or existing variables/coordinates to those dimensions to gain as much of the xarray machinery:
var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
dims = var.dims
xds['var'] = var.rename({dims[0]: 'x', dims[1]: 'y'})
xds = xds.assign({'x': (['x'], xvals, xattrs)})
xds = xds.assign({'y': (['y'], yvals, yattrs)})
ds = xarray.decode_cf(xds)
References:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With