So I have 3 netcdf4 files (each approx 90 MB), which I would like to concatenate using the package xarray. Each file has one variable (dis) represented at a 0.5 degree resolution (lat, lon) for 365 days (time). My aim is to concatenate the three files such that we have a timeseries of 1095 days (3 years).
Each file (for years 2007, 2008, 2009) has: 1 variable: dis 3 coordinates: time, lat, lon ... as such
<xarray.Dataset>
Dimensions: (lat: 360, lon: 720, time: 365)
Coordinates:
* lon (lon) float32 -179.75 -179.25 -178.75 -178.25 -177.75 -177.25 ...
* lat (lat) float32 89.75 89.25 88.75 88.25 87.75 87.25 86.75 86.25 ...
* time (time) datetime64[ns] 2007-01-01 2007-01-02 2007-01-03 ...
Data variables:
dis (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ...
I get them imported and use the concat module to concatenate, I think successfully. In this case the module reads out 3 netcdf filenames from filestrF
flist1 = [1,2,3]
ds_new = xr.concat([xr.open_dataset(filestrF[0,1,1,f]) for f in flist1],dim='time')
New details of the new dataset are shown to now be:
Dimensions: (lat: 360, lon: 720, time: 1095)
Seems fine to me. However, when I write this dataset back to a netcdf, the filesize has now exploded, with 1 year of data seemingly equivalent to 700 MB.
ds_new.to_netcdf('saved_on_disk1.nc')
I would have expected 3 x 90 MB = 270 MB - since we are scaling (3x) in one dimension(time). The variable, dis, and other dimensions lat and lon remain constant in size.
Any ideas please for the huge upscale in size? I have tested reading in and writing back out files without concatenation, and do this successfully with no increase in size.
Xarray doesn't have an append method because its data structures are built on top of NumPy's non-resizable arrays, so we cannot append new elements without copying the entire array. Hence, we don't implement an append method. Instead, you should use xarray. concat .
The recommended way to store xarray data structures is netCDF, which is a binary file format for self-described datasets that originated in the geosciences. xarray is based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects.
xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!
To install with anaconda (conda) simply type conda install netCDF4 . Alternatively, you can install with pip . To be sure your netCDF4 module is properly installed start an interactive session in the terminal (type python and press 'Enter'). Then import netCDF4 as nc .
The netCDF files you started with are compressed, probably using netCDF4's chunk-wise compression feature.
When you read a single dataset and write it back to disk, xarray writes that data back with the same compression settings. But when you combine multiple files, the compression settings are reset. Part of the reason for this is that different file may be compressed on disk in different ways, so it isn't obvious how the combined result should be handled.
To save the new netCDF file with compression, use the encoding
argument, as described in the xarray docs:
ds_new.to_netcdf('saved_on_disk1.nc', encoding={'dis': {'zlib': True}})
You will probably also want to manually specify the chunksizes
argument based on your expected access patterns for the data.
If you're curious how these files were compressed originally, you can pull that information out from the encoding
attribute, e.g., xr.open_dataset(filestrF[0,1,1,1]).dis.encoding
.
Presuming that time
is the record dimension, try using NCO's ncrcat to quickly concatenate the three files that should preserve compression.
ncrcat file1.nc file2.nc file3.nc -O concat.nc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With