I have a process that grows a NetCDF file fn
every 5 minutes using netcdf4.Dataset(fn, mode=a)
. I also have a bokeh server visualization of that NetCDF file using a xarray.Dataset
(which I want to keep, because it is so convenient).
The problem is that the NetCDF-update-process fails when trying to add new data to fn
if it is open in my bokeh server process via
ds = xarray.open_dataset(fn)
If I use the option autoclose
ds = xarray.open_dataset(fn, autoclose=True)
updating fn
with the other process while ds
is "open" in the bokeh server app works, but the updates to the bokeh figure, which pull time slices from fn
, get very laggy.
My question is: Is there another way to release the lock of the NetCDF file when using xarray.Dataset
?
I would not care if the shape of the xarray.Dataset is only updated consistently after reloading the whole bokeh server app.
Thanks!
Here is a minimal working example:
Put this into a file and let it run:
import time
from datetime import datetime
import numpy as np
import netCDF4
fn = 'my_growing_file.nc'
with netCDF4.Dataset(fn, 'w') as nc_fh:
# create dimensions
nc_fh.createDimension('x', 90)
nc_fh.createDimension('y', 90)
nc_fh.createDimension('time', None)
# create variables
nc_fh.createVariable('x', 'f8', ('x'))
nc_fh.createVariable('y', 'f8', ('y'))
nc_fh.createVariable('time', 'f8', ('time'))
nc_fh.createVariable('rainfall_amount',
'i2',
('time', 'y', 'x'),
zlib=False,
complevel=0,
fill_value=-9999,
chunksizes=(1, 90, 90))
nc_fh['rainfall_amount'].scale_factor = 0.1
nc_fh['rainfall_amount'].add_offset = 0
nc_fh.set_auto_maskandscale(True)
# variable attributes
nc_fh['time'].long_name = 'Time'
nc_fh['time'].standard_name = 'time'
nc_fh['time'].units = 'hours since 2000-01-01 00:50:00.0'
nc_fh['time'].calendar = 'standard'
for i in range(1000):
with netCDF4.Dataset(fn, 'a') as nc_fh:
current_length = len(nc_fh['time'])
print('Appending to NetCDF file {}'.format(fn))
print(' length of time vector: {}'.format(current_length))
if current_length > 0:
last_time_stamp = netCDF4.num2date(
nc_fh['time'][-1],
units=nc_fh['time'].units,
calendar=nc_fh['time'].calendar)
print(' last time stamp in NetCDF: {}'.format(str(last_time_stamp)))
else:
last_time_stamp = '1900-01-01'
print(' empty file, starting from scratch')
nc_fh['time'][i] = netCDF4.date2num(
datetime.utcnow(),
units=nc_fh['time'].units,
calendar=nc_fh['time'].calendar)
nc_fh['rainfall_amount'][i, :, :] = np.random.rand(90, 90)
print('Sleeping...\n')
time.sleep(3)
Then, go to e.g. IPython and open the growing file via:
ds = xr.open_dataset('my_growing_file.nc')
This will cause the process that appends to the NetCDF to fail with an output like this:
Appending to NetCDF file my_growing_file.nc
length of time vector: 0
empty file, starting from scratch
Sleeping...
Appending to NetCDF file my_growing_file.nc
length of time vector: 1
last time stamp in NetCDF: 2018-04-12 08:52:39.145999
Sleeping...
Appending to NetCDF file my_growing_file.nc
length of time vector: 2
last time stamp in NetCDF: 2018-04-12 08:52:42.159254
Sleeping...
Appending to NetCDF file my_growing_file.nc
length of time vector: 3
last time stamp in NetCDF: 2018-04-12 08:52:45.169516
Sleeping...
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-17-9950ca2e53a6> in <module>()
37
38 for i in range(1000):
---> 39 with netCDF4.Dataset(fn, 'a') as nc_fh:
40 current_length = len(nc_fh['time'])
41
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()
IOError: [Errno -101] NetCDF: HDF error: 'my_growing_file.nc'
If using
ds = xr.open_dataset('my_growing_file.nc', autoclose=True)
there is no error, but access times via xarray
of course get slower, which is exactly my problem since my dashboard visualization gets very laggy.
I can understand that this is maybe not the intended use for xarray
and, if required, I will fall back to the lower level interface provided by netCDF4
(hoping that it supports concurrent file access, at least for reads), but I would like to keep xarray
for its convenience.
I am answering my own question here because I found a solution, or better said, a way around this problem with the file lock of NetCDF in Python.
A good solution is to use zarr instead of NetCDF files if you want to continuously grow a dataset in a file while keeping it open for, e.g. a real-time visualization.
Luckily xarray
now also easily allows to append data to an existing zarr file along a selected dimension using the append_dim
keyword argument, thanks to a recently merged PR.
The code for using zarr, instead of NetCDF like in my question, looks like this:
import dask.array as da
import xarray as xr
import pandas as pd
import datetime
import time
fn = '/tmp/my_growing_file.zarr'
# Creat a dummy dataset and write it to zarr
data = da.random.random(size=(30, 900, 1200), chunks=(10, 900, 1200))
t = pd.date_range(end=datetime.datetime.utcnow(), periods=30, freq='1s')
ds = xr.Dataset(
data_vars={'foo': (('time', 'y', 'x'), data)},
coords={'time': t},
)
#ds.to_zarr(fn, mode='w', encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue':-9999}})
#ds.to_zarr(fn, mode='w', encoding={'time': {'_FillValue': -9999}})
ds.to_zarr(fn, mode='w')
# Append new data in smaller chunks
for i in range(100):
print('Sleeping for 10 seconds...')
time.sleep(10)
data = 0.01 * i + da.random.random(size=(7, 900, 1200), chunks=(7, 900, 1200))
t = pd.date_range(end=datetime.datetime.utcnow(), periods=7, freq='1s')
ds = xr.Dataset(
data_vars={'foo': (('time', 'y', 'x'), data)},
coords={'time': t},
)
print(f'Appending 7 new time slices with latest time stamp {t[-1]}')
ds.to_zarr(fn, append_dim='time')
You can then open another Python process, e.g. IPython and do
ds = xr.open_zarr('/tmp/my_growing_file.zarr/')
over and over again without crashing the writer process.
I used xarray verion 0.15.0 and zarr version 2.4.0 for this example.
Some additional note:
Note that the code in this example deliberately appends in small chunks that unevenly divide the chunk size in the zarr file to see how this affects the chunks. From my tests I can say that the initially chosen chunk size of the zarr file is preserved, which is great!
Also note that the code generates a warning when appending because the datetime64
data is encoded and stored as integer by xarray
to comply with the CF conventions for NetCDF. This also works for zarr files, but currently it seems that the _FillValue
is not automatically set. As long as you do not have NaT
in your time data this should not matter.
Disclaimer: I have not yet tried this with a larger dataset and a long-running process which grows the file, so I cannot comment on eventual performance degradation or other problems that might occur if zarr files or its metadata get somehow fragmented from this process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With