In my experimentation so far, I've tried:
xr.open_dataset
with chunks
arg, and it loads the data into memory.NetCDF4DataStore
, and call ds['field'].values
and it loads the data into memory.ScipyDataStore
with mmap='r'
, and ds['field'].values
loads the data into memory.From what I have seen, the design seems to center not around actually applying numpy functions on memory-mapped arrays, but rather loading small chunks into memory (sometimes using memory-mapping to do so). For example, this comment. And somewhat related comment here about not xarray not being able to determine whether a numpy array is mmapped or not.
I'd like to be able to represent and slice data as an xarray.Dataset
, and be able to call .values
(or .data
) to get an ndarray
, but have it remain mmapped (for purposes of shared-memory and so on).
It would also be nice if chunked dask operations could at least operate on the memory-mapped array until it actually needs to mutate something, which seems possible since dask seems to be designed around immutable arrays.
I did find a trick with xarray, though, which is to do like so:
data=np.load('file.npy', mmap_mode='r')
ds=xr.Dataset({'foo': (['dim1', 'dim2'], data)})
At this point, things like the following work without loading anything into memory:
np.sum(ds['foo'].values)
np.sum(ds['foo'][::2,:].values)
...xarray apparently doesn't know that the array is mmapped, and can't afford to impose a np.copy
for cases like these.
Is there a "supported" way to do read-only memmapping (or copy-on write for that matter) in xarray or dask?
xr.open_dataset
with chunks=
should not immediately load data into memory, it should create a dask.array
, which evaluates lazily.
testfile = '/Users/mdurant/data/smith_sandwell_topo_v8_2.nc'
arr = xr.open_dataset(testfile, chunks={'latitude': 6336//11, 'longitude': 10800//15}).ROSE
arr
<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)>
dask.array</Users/mdurant/data/smith_sandwell_topo_v8_2.nc:/ROSE, shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>
Coordinates:
* longitude (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ...
* latitude (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ...
Attributes:
long_name: Topography and Bathymetry ( 8123m -> -10799m)
units: meters
valid_range: [-32766 32767]
unpacked_missing_value: -32767.0
(note the dask.array in the above)
Many xarray operations on this may be lazy, and work chunkwise (and if you slice, only required chunks would be loaded)
arr.sum()
<xarray.DataArray 'ROSE' ()>
dask.array<sum-aggregate, shape=(), dtype=float64, chunksize=()>
arr.sum().values # evaluates
This is not the same as memory mapping, however, so I appreciate if this doesn't answer your question.
With dask's threaded scheduler, in-memory values are available to the other workers, so sharing would be quite efficient. Conversely, the distributed scheduler is quite good at recognising when results can be reused within a computation graph or between graphs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With