Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to release the file lock for a xarray.Dataset?

I have a process that grows a NetCDF file fn every 5 minutes using netcdf4.Dataset(fn, mode=a). I also have a bokeh server visualization of that NetCDF file using a xarray.Dataset (which I want to keep, because it is so convenient).

The problem is that the NetCDF-update-process fails when trying to add new data to fn if it is open in my bokeh server process via

ds = xarray.open_dataset(fn)

If I use the option autoclose

ds = xarray.open_dataset(fn, autoclose=True)

updating fn with the other process while ds is "open" in the bokeh server app works, but the updates to the bokeh figure, which pull time slices from fn, get very laggy.

My question is: Is there another way to release the lock of the NetCDF file when using xarray.Dataset?

I would not care if the shape of the xarray.Dataset is only updated consistently after reloading the whole bokeh server app.

Thanks!

Here is a minimal working example:

Put this into a file and let it run:

import time
from datetime import datetime

import numpy as np
import netCDF4

fn = 'my_growing_file.nc'

with netCDF4.Dataset(fn, 'w') as nc_fh:
    # create dimensions
    nc_fh.createDimension('x', 90)
    nc_fh.createDimension('y', 90)
    nc_fh.createDimension('time', None)

    # create variables
    nc_fh.createVariable('x', 'f8', ('x'))
    nc_fh.createVariable('y', 'f8', ('y'))
    nc_fh.createVariable('time', 'f8', ('time'))
    nc_fh.createVariable('rainfall_amount',
                         'i2',
                         ('time', 'y', 'x'),
                         zlib=False,
                         complevel=0,
                         fill_value=-9999,
                         chunksizes=(1, 90, 90))
    nc_fh['rainfall_amount'].scale_factor = 0.1
    nc_fh['rainfall_amount'].add_offset = 0

    nc_fh.set_auto_maskandscale(True)

    # variable attributes
    nc_fh['time'].long_name = 'Time'
    nc_fh['time'].standard_name = 'time'
    nc_fh['time'].units = 'hours since 2000-01-01 00:50:00.0'
    nc_fh['time'].calendar = 'standard'

for i in range(1000):
    with netCDF4.Dataset(fn, 'a') as nc_fh:
        current_length = len(nc_fh['time'])

        print('Appending to NetCDF file {}'.format(fn))
        print(' length of time vector: {}'.format(current_length))

        if current_length > 0:
            last_time_stamp = netCDF4.num2date(
                nc_fh['time'][-1],
                units=nc_fh['time'].units,
                calendar=nc_fh['time'].calendar)
            print(' last time stamp in NetCDF: {}'.format(str(last_time_stamp)))
        else:
            last_time_stamp = '1900-01-01'
            print(' empty file, starting from scratch')

        nc_fh['time'][i] = netCDF4.date2num(
            datetime.utcnow(),
            units=nc_fh['time'].units,
            calendar=nc_fh['time'].calendar)
        nc_fh['rainfall_amount'][i, :, :] = np.random.rand(90, 90)

    print('Sleeping...\n')
    time.sleep(3)

Then, go to e.g. IPython and open the growing file via:

ds = xr.open_dataset('my_growing_file.nc')

This will cause the process that appends to the NetCDF to fail with an output like this:

Appending to NetCDF file my_growing_file.nc
 length of time vector: 0
 empty file, starting from scratch
Sleeping...

Appending to NetCDF file my_growing_file.nc
 length of time vector: 1
 last time stamp in NetCDF: 2018-04-12 08:52:39.145999
Sleeping...

Appending to NetCDF file my_growing_file.nc
 length of time vector: 2
 last time stamp in NetCDF: 2018-04-12 08:52:42.159254
Sleeping...

Appending to NetCDF file my_growing_file.nc
 length of time vector: 3
 last time stamp in NetCDF: 2018-04-12 08:52:45.169516
Sleeping...

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-17-9950ca2e53a6> in <module>()
     37 
     38 for i in range(1000):
---> 39     with netCDF4.Dataset(fn, 'a') as nc_fh:
     40         current_length = len(nc_fh['time'])
     41 

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

IOError: [Errno -101] NetCDF: HDF error: 'my_growing_file.nc'

If using

ds = xr.open_dataset('my_growing_file.nc', autoclose=True)

there is no error, but access times via xarray of course get slower, which is exactly my problem since my dashboard visualization gets very laggy.

I can understand that this is maybe not the intended use for xarray and, if required, I will fall back to the lower level interface provided by netCDF4 (hoping that it supports concurrent file access, at least for reads), but I would like to keep xarray for its convenience.

like image 399
cchwala Avatar asked Nov 07 '22 08:11

cchwala


1 Answers

I am answering my own question here because I found a solution, or better said, a way around this problem with the file lock of NetCDF in Python.

A good solution is to use zarr instead of NetCDF files if you want to continuously grow a dataset in a file while keeping it open for, e.g. a real-time visualization.

Luckily xarray now also easily allows to append data to an existing zarr file along a selected dimension using the append_dim keyword argument, thanks to a recently merged PR.

The code for using zarr, instead of NetCDF like in my question, looks like this:


import dask.array as da
import xarray as xr
import pandas as pd
import datetime
import time

fn = '/tmp/my_growing_file.zarr'

# Creat a dummy dataset and write it to zarr
data = da.random.random(size=(30, 900, 1200), chunks=(10, 900, 1200))
t = pd.date_range(end=datetime.datetime.utcnow(), periods=30, freq='1s')
ds = xr.Dataset(
    data_vars={'foo': (('time', 'y', 'x'), data)},
    coords={'time': t},
)
#ds.to_zarr(fn, mode='w', encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue':-9999}})
#ds.to_zarr(fn, mode='w', encoding={'time': {'_FillValue': -9999}})
ds.to_zarr(fn, mode='w')

# Append new data in smaller chunks
for i in range(100):
    print('Sleeping for 10 seconds...')
    time.sleep(10)

    data = 0.01 * i + da.random.random(size=(7, 900, 1200), chunks=(7, 900, 1200))
    t = pd.date_range(end=datetime.datetime.utcnow(), periods=7, freq='1s')
    ds = xr.Dataset(
        data_vars={'foo': (('time', 'y', 'x'), data)},
        coords={'time': t},
    )
    print(f'Appending 7 new time slices with latest time stamp {t[-1]}')
    ds.to_zarr(fn, append_dim='time')

You can then open another Python process, e.g. IPython and do

 ds = xr.open_zarr('/tmp/my_growing_file.zarr/')   

over and over again without crashing the writer process.

I used xarray verion 0.15.0 and zarr version 2.4.0 for this example.

Some additional note:

Note that the code in this example deliberately appends in small chunks that unevenly divide the chunk size in the zarr file to see how this affects the chunks. From my tests I can say that the initially chosen chunk size of the zarr file is preserved, which is great!

Also note that the code generates a warning when appending because the datetime64 data is encoded and stored as integer by xarray to comply with the CF conventions for NetCDF. This also works for zarr files, but currently it seems that the _FillValue is not automatically set. As long as you do not have NaT in your time data this should not matter.

Disclaimer: I have not yet tried this with a larger dataset and a long-running process which grows the file, so I cannot comment on eventual performance degradation or other problems that might occur if zarr files or its metadata get somehow fragmented from this process.

like image 161
cchwala Avatar answered Nov 15 '22 11:11

cchwala