Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xarray writing to netCDF from Pandas - dimension issue

Learning how to produce netCDF files from Pandas DFs, using xarray. Following several tutorials and SO questions Add 'constant' dimension to xarray Dataset and Add 'constant' dimension to xarray Dataset but having some issues still ,as I can't get the Date_Time, lat and lon as dimensions. When I do a nc dump, they are not correct.

Initial approach importing txt file to pandas df then xr to netCDF:

import pandas as pd
import xarray

#IMport Data from .dat file
colnames1 = ['Date','Time','latitude','longitude','Status','depth']
df2 = pd.read_csv('test.txt',header=0,error_bad_lines=False, names = colnames1,delim_whitespace=True)

# create xray Dataset from Pandas DataFrame
xr = xarray.Dataset.from_dataframe(df2)

# add variable attribute metadata
xr['latitude'].attrs={'units':'degrees', 'long_name':'Latitude'}
xr['longitude'].attrs={'units':'degrees', 'long_name':'Longitude'}
xr['depth'].attrs={'units':'m', 'long_name':'depth'}


# add global attribute metadata
xr.attrs={'Conventions':'CF-1.6', 'title':'Data', 'summary':'Data generated'}
#print xr
print xr
# save to netCDF
xr.to_netcdf('test.nc')

where df2 =

Date            Time  grid_latitude  grid_longitude  Status  depth                                                                   
2017-09-05  13:01:59     -29.034083       31.068567     2.0    0.0   
2017-09-05  13:01:59     -29.039367       31.059150     2.0    0.0   
2017-09-05  13:01:59     -29.036650       31.059200     3.0    0.0   
2017-09-05  13:01:59     -29.036750       31.065417     7.0  100.0   
2017-09-05  13:01:59     -29.039317       31.056050     7.0  100.0   
2017-09-05  13:01:59     -29.034000       31.062367     3.0    0.0   
2017-09-05  13:01:59     -29.036517       31.049900     3.0    0.0   
2017-09-05  13:01:59     -29.031100       31.050000     3.0    0.0 

This works fine but the dimension is not correct (see below):

<xarray.Dataset>
Dimensions:    (index: 8)
Coordinates:
  * index      (index) int64 0 1 2 3 4 5 6 7
Data variables:
    Date       (index) object '2017-09-05' '2017-09-05' '2017-09-05' ...
    Time       (index) object '13:01:59' '13:01:59' '13:01:59' '13:01:59' ...
    latitude   (index) float64 -29.03 -29.04 -29.04 -29.04 -29.04 -29.03 ...
    longitude  (index) float64 31.07 31.06 31.06 31.07 31.06 31.06 31.05 31.05
    Status     (index) float64 2.0 2.0 3.0 7.0 7.0 3.0 3.0 3.0
    depth      (index) float64 0.0 0.0 0.0 100.0 100.0 0.0 0.0 0.0
Attributes:
    title: Data
    summary: Data generated
    Conventions: CF-1.6

If I set the Date, or a merged Date_Time, as the DF index, the dimension for the Date/Time is fine and seen as a dimension:

<xarray.Dataset>
Dimensions:    (Date: 8)
Coordinates:
  * Date       (Date) object '2017-09-05' '2017-09-05' '2017-09-05' ...
Data variables:
    Time       (Date) object '13:01:59' '13:01:59' '13:01:59' '13:01:59' ...
    latitude   (Date) float64 -29.03 -29.04 -29.04 -29.04 -29.04 -29.03 ...
    longitude  (Date) float64 31.07 31.06 31.06 31.07 31.06 31.06 31.05 31.05
    Status     (Date) float64 2.0 2.0 3.0 7.0 7.0 3.0 3.0 3.0
    depth      (Date) float64 0.0 0.0 0.0 100.0 100.0 0.0 0.0 0.0
Attributes:
    title: Data
    summary: Data generated
    Conventions: CF-1.6

But if I set the df.index on the Date_Time, Lat and Lon, it reverts back to the blank (index). Would appreciate pointers to get the dimensions set. With the netCDF module one could use the syntax: lat = dataset.createDimension('lat', 73) to create a dimension. The SO example add dimension to an xarray DataArray doesn't help either. Maybe I'm missing something, or it's my limitation on learning. I'd like to get it to the point where the nc dump produces something similar to this.

NetCDF dimension information:
        Name: lat
                size: 73
                type: dtype('float32')
                units: u'degrees_north'
                actual_range: array([ 90., -90.], dtype=float32)
                long_name: u'Latitude'
                standard_name: u'latitude'
                axis: u'Y'
        Name: lon
                size: 144
                type: dtype('float32')
                units: u'degrees_east'
                long_name: u'Longitude'
                actual_range: array([   0. ,  357.5], dtype=float32)
                standard_name: u'longitude'
                axis: u'X'
        Name: time
                size: 366
                type: dtype('float64')
                units: u'hours since 1-1-1 00:00:0.0'
                long_name: u'Time'
                actual_range: array([ 17628096.,  17636856.])
                delta_t: u'0000-00-01 00:00:00'
                standard_name: u'time'
                axis: u'T'
                avg_period: u'0000-00-01 00:00:00'

Else I could convert the DF columns to a np array, and use the netCDF module? Many thanks in advance. I did venture to trying something like this, but I doubt it's on the right path:

#add dimeensions
#d = {}
#d['time'] = ('time',df2.Time)
#d['latitude'] = ('latitude',df2.latitude)
#d['longitude'] = ('longitude', df2.longitude)
#d['var'] = (['time','latitude','longitude','Depth'], xr)
#xr = xray.Dataset(d)
like image 299
Clint Avatar asked Sep 28 '17 19:09

Clint


1 Answers

This is easiest to achieve by combining Time, grid_latitude and grid_longitude into a pandas.MultiIndex on the DataFrame with set_index() before converting into an xarray Dataset.

For example:

# note that pandas.DataFrame's to_xarray() method is equivalent to
# xarray.Dataset.from_dataframe()
ds = df.set_index(['Time', 'grid_latitude', 'grid_longitude']).to_xarray()
like image 113
shoyer Avatar answered Nov 05 '22 23:11

shoyer