I have a number of coordinates (roughly 20000) for which I need to extract data from a number of NetCDF files each comes roughly with 30000 timesteps (future climate scenarios). Using the solution here is not efficient and the reason is the time spent at each i,j to convert "dsloc" to "dataframe" (look at the code below). ** an example NetCDF file could be download from here **
import pandas as pd
import xarray as xr
import time
#Generate some coordinates
coords_data = [{'lat': 68.04, 'lon': 15.20, 'stid':1},
{'lat':67.96, 'lon': 14.95, 'stid': 2}]
crd= pd.DataFrame(coords_data)
lat = crd["lat"]
lon = crd["lon"]
stid=crd["stid"]
NC = xr.open_dataset(nc_file)
point_list = zip(lat,lon,stid)
start_time = time.time()
for i,j,id in point_list:
print(i,j)
dsloc = NC.sel(lat=i,lon=j,method='nearest')
print("--- %s seconds ---" % (time.time() - start_time))
DT=dsloc.to_dataframe()
DT.insert(loc=0,column="station",value=id)
DT.reset_index(inplace=True)
temp=temp.append(DT,sort=True)
print("--- %s seconds ---" % (time.time() - start_time))
which results is:
68.04 15.2
--- 0.005853414535522461 seconds ---
--- 9.02660846710205 seconds ---
67.96 14.95
--- 9.028568267822266 seconds ---
--- 16.429600715637207 seconds ---
which means each i,j takes around 9 seconds to process. Given lots of coordinates and netcdf files with large timesteps, I wonder if there a pythonic way that the code could be optimized. I could also use CDO and NCO operators but I found a similar issue using them too.
Conversion from NETCDF to XLSX Upload your NETCDF data (widely used in software like QGIS) and convert them by one click to XLSX format (widely used in software like MS Excel). Notice to XLSX format - In case your data are POINT type, then XY coordinates will be exported as well.
The raster package is large library of functions and methods for dealing with, as its name implies, raster data sets, including many geospatial formats, including netCDF.
This is a perfect use case for xarray's advanced indexing using a DataArray index.
# Make the index on your coordinates DataFrame the station ID,
# then convert to a dataset.
# This results in a Dataset with two DataArrays, lat and lon, each
# of which are indexed by a single dimension, stid
crd_ix = crd.set_index('stid').to_xarray()
# now, select using the arrays, and the data will be re-oriented to have
# the data only for the desired pixels, indexed by 'stid'. The
# non-indexing coordinates lat and lon will be indexed by (stid) as well.
NC.sel(lon=crd_ix.lon, lat=crd_ix.lat, method='nearest')
Other dimensions in the data will be ignored, so if your original data has dimensions (lat, lon, z, time)
your new data would have dimensions (stid, z, time)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With