Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python xarray.concat then xarray.to_netcdf generates huge new file size

So I have 3 netcdf4 files (each approx 90 MB), which I would like to concatenate using the package xarray. Each file has one variable (dis) represented at a 0.5 degree resolution (lat, lon) for 365 days (time). My aim is to concatenate the three files such that we have a timeseries of 1095 days (3 years).

Each file (for years 2007, 2008, 2009) has: 1 variable: dis 3 coordinates: time, lat, lon ... as such

<xarray.Dataset>
Dimensions:  (lat: 360, lon: 720, time: 365)
Coordinates:
  * lon      (lon) float32 -179.75 -179.25 -178.75 -178.25 -177.75 -177.25    ...
  * lat      (lat) float32 89.75 89.25 88.75 88.25 87.75 87.25 86.75 86.25 ...
  * time     (time) datetime64[ns] 2007-01-01 2007-01-02 2007-01-03 ...
Data variables:
    dis      (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ...

I get them imported and use the concat module to concatenate, I think successfully. In this case the module reads out 3 netcdf filenames from filestrF

flist1 = [1,2,3]
ds_new = xr.concat([xr.open_dataset(filestrF[0,1,1,f]) for f in flist1],dim='time')

New details of the new dataset are shown to now be:

Dimensions:  (lat: 360, lon: 720, time: 1095)

Seems fine to me. However, when I write this dataset back to a netcdf, the filesize has now exploded, with 1 year of data seemingly equivalent to 700 MB.

ds_new.to_netcdf('saved_on_disk1.nc')
  • For 2 concatenated files, ~1.5 GB
  • For 3 ,, ,, 2.2 GB
  • For 4 ,, ,, 2.9 GB

I would have expected 3 x 90 MB = 270 MB - since we are scaling (3x) in one dimension(time). The variable, dis, and other dimensions lat and lon remain constant in size.

Any ideas please for the huge upscale in size? I have tested reading in and writing back out files without concatenation, and do this successfully with no increase in size.

like image 699
dreab Avatar asked May 19 '16 13:05

dreab


People also ask

How do you append Xarray?

Xarray doesn't have an append method because its data structures are built on top of NumPy's non-resizable arrays, so we cannot append new elements without copying the entire array. Hence, we don't implement an append method. Instead, you should use xarray. concat .

How do you store Xarray?

The recommended way to store xarray data structures is netCDF, which is a binary file format for self-described datasets that originated in the geosciences. xarray is based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects.

What is Xarray?

xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

How do I import netCDF4 into Python?

To install with anaconda (conda) simply type conda install netCDF4 . Alternatively, you can install with pip . To be sure your netCDF4 module is properly installed start an interactive session in the terminal (type python and press 'Enter'). Then import netCDF4 as nc .


2 Answers

The netCDF files you started with are compressed, probably using netCDF4's chunk-wise compression feature.

When you read a single dataset and write it back to disk, xarray writes that data back with the same compression settings. But when you combine multiple files, the compression settings are reset. Part of the reason for this is that different file may be compressed on disk in different ways, so it isn't obvious how the combined result should be handled.

To save the new netCDF file with compression, use the encoding argument, as described in the xarray docs:

ds_new.to_netcdf('saved_on_disk1.nc', encoding={'dis': {'zlib': True}})

You will probably also want to manually specify the chunksizes argument based on your expected access patterns for the data.

If you're curious how these files were compressed originally, you can pull that information out from the encoding attribute, e.g., xr.open_dataset(filestrF[0,1,1,1]).dis.encoding.

like image 193
shoyer Avatar answered Sep 29 '22 12:09

shoyer


Presuming that time is the record dimension, try using NCO's ncrcat to quickly concatenate the three files that should preserve compression.

ncrcat file1.nc file2.nc file3.nc -O concat.nc

like image 39
N1B4 Avatar answered Sep 29 '22 12:09

N1B4