So I have 3 netcdf4 files (each approx 90 MB), which I would like to concatenate using the package xarray. Each file has one variable (dis) represented at a 0.5 degree resolution (lat, lon) for 365 days (time). My aim is to concatenate the three files such that we have a timeseries of 1095 days (3 years). Each file (for years 2007, 2008, 2009) has: 1 variable: dis 3 coordinates: time, lat, lon ... as such <pre class="prettyprint"><code><xarray.Dataset> Dimensions: (lat: 360, lon: 720, time: 365) Coordinates: * lon (lon) float32 -179.75 -179.25 -178.75 -178.25 -177.75 -177.25 ... * lat (lat) float32 89.75 89.25 88.75 88.25 87.75 87.25 86.75 86.25 ... * time (time) datetime64[ns] 2007-01-01 2007-01-02 2007-01-03 ... Data variables: dis (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ... </code></pre> I get them imported and use the concat module to concatenate, I think successfully. In this case the module reads out 3 netcdf filenames from filestrF <pre class="prettyprint"><code>flist1 = [1,2,3] ds_new = xr.concat([xr.open_dataset(filestrF[0,1,1,f]) for f in flist1],dim='time') </code></pre> New details of the new dataset are shown to now be: <pre class="prettyprint"><code>Dimensions: (lat: 360, lon: 720, time: 1095) </code></pre> Seems fine to me. However, when I write this dataset back to a netcdf, the filesize has now exploded, with 1 year of data seemingly equivalent to 700 MB. <pre class="prettyprint"><code>ds_new.to_netcdf('saved_on_disk1.nc') </code></pre> <ul> <li>For 2 concatenated files, ~1.5 GB</li> <li>For 3 ,, ,, 2.2 GB</li> <li>For 4 ,, ,, 2.9 GB</li> </ul> I would have expected 3 x 90 MB = 270 MB - since we are scaling (3x) in one dimension(time). The variable, dis, and other dimensions lat and lon remain constant in size. Any ideas please for the huge upscale in size? I have tested reading in and writing back out files without concatenation, and do this successfully with no increase in size.

The netCDF files you started with are compressed, probably using netCDF4's chunk-wise compression feature. When you read a single dataset and write it back to disk, xarray writes that data back with the same compression settings. But when you combine multiple files, the compression settings are reset. Part of the reason for this is that different file may be compressed on disk in different ways, so it isn't obvious how the combined result should be handled. To save the new netCDF file with compression, use the <code>encoding</code> argument, as described in the xarray docs: <pre class="prettyprint"><code>ds_new.to_netcdf('saved_on_disk1.nc', encoding={'dis': {'zlib': True}}) </code></pre> You will probably also want to manually specify the <code>chunksizes</code> argument based on your expected access patterns for the data. If you're curious how these files were compressed originally, you can pull that information out from the <code>encoding</code> attribute, e.g., <code>xr.open_dataset(filestrF[0,1,1,1]).dis.encoding</code>.

Python xarray.concat then xarray.to_netcdf generates huge new file size

Tags:

python

concatenation

python-xarray

So I have 3 netcdf4 files (each approx 90 MB), which I would like to concatenate using the package xarray. Each file has one variable (dis) represented at a 0.5 degree resolution (lat, lon) for 365 days (time). My aim is to concatenate the three files such that we have a timeseries of 1095 days (3 years).

Each file (for years 2007, 2008, 2009) has: 1 variable: dis 3 coordinates: time, lat, lon ... as such

<xarray.Dataset>
Dimensions:  (lat: 360, lon: 720, time: 365)
Coordinates:
  * lon      (lon) float32 -179.75 -179.25 -178.75 -178.25 -177.75 -177.25    ...
  * lat      (lat) float32 89.75 89.25 88.75 88.25 87.75 87.25 86.75 86.25 ...
  * time     (time) datetime64[ns] 2007-01-01 2007-01-02 2007-01-03 ...
Data variables:
    dis      (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ...

I get them imported and use the concat module to concatenate, I think successfully. In this case the module reads out 3 netcdf filenames from filestrF

flist1 = [1,2,3]
ds_new = xr.concat([xr.open_dataset(filestrF[0,1,1,f]) for f in flist1],dim='time')

New details of the new dataset are shown to now be:

Dimensions:  (lat: 360, lon: 720, time: 1095)

Seems fine to me. However, when I write this dataset back to a netcdf, the filesize has now exploded, with 1 year of data seemingly equivalent to 700 MB.

ds_new.to_netcdf('saved_on_disk1.nc')

For 2 concatenated files, ~1.5 GB
For 3 ,, ,, 2.2 GB
For 4 ,, ,, 2.9 GB

I would have expected 3 x 90 MB = 270 MB - since we are scaling (3x) in one dimension(time). The variable, dis, and other dimensions lat and lon remain constant in size.

Any ideas please for the huge upscale in size? I have tested reading in and writing back out files without concatenation, and do this successfully with no increase in size.

699

asked May 19 '16 13:05

dreab

2 Answers

The netCDF files you started with are compressed, probably using netCDF4's chunk-wise compression feature.

When you read a single dataset and write it back to disk, xarray writes that data back with the same compression settings. But when you combine multiple files, the compression settings are reset. Part of the reason for this is that different file may be compressed on disk in different ways, so it isn't obvious how the combined result should be handled.

To save the new netCDF file with compression, use the encoding argument, as described in the xarray docs:

ds_new.to_netcdf('saved_on_disk1.nc', encoding={'dis': {'zlib': True}})

You will probably also want to manually specify the chunksizes argument based on your expected access patterns for the data.

If you're curious how these files were compressed originally, you can pull that information out from the encoding attribute, e.g., xr.open_dataset(filestrF[0,1,1,1]).dis.encoding.

193

answered Sep 29 '22 12:09

shoyer

Presuming that time is the record dimension, try using NCO's ncrcat to quickly concatenate the three files that should preserve compression.

ncrcat file1.nc file2.nc file3.nc -O concat.nc

answered Sep 29 '22 12:09

N1B4

Related questions
                            
                                FCN in TensorFlow missing crop layer
                            
                                Python only find string with specific length numbers
                            
                                Pandas: find maximum value, when and if conditions
                            
                                Python suppress shell output [duplicate]
                            
                                Cython unable to find shared object file
                            
                                Why use pandas qcut return ValueError: Bin edges must be unique?
                            
                                How to add a new work sheet to work book in xlsxwriter
                            
                                python check if a string end with a number in a range valid
                            
                                numpy matrix to pandas Series
                            
                                Zip list of tuples with flat list
                            
                                Docker container keeps growing
                            
                                # -*- coding: utf-8 -*- on python3 [duplicate]
                            
                                Why do my nested python class instances become tuples?
                            
                                How to speed up pandas groupby - apply function to be comparable to R's data.table
                            
                                efficiently read one file from a zip containing a lot of files in python
                            
                                Pybind11 Type Error
                            
                                BeagleBone Black OpenCV Python is too slow
                            
                                "SignatureError: Failed to verify signature" - Okta, pySAML2
                            
                                How to see full HTTPS URL in wireShark
                            
                                Bash pass string argument to python script

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With