Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conversion from (xarray) dask.array to numpy array is very slow

I recently started using xarray. It's a very helpful tool. However I am having some minor issues, which I am sure have an easy fix. My problem is that from a multidimenstional (time, latitue, longitude) dask.array I want to select time series for a given latitude and longitude value. Slicing using .sel works very well and fast but when I try to get the actual value by using np.array option it takes a lot of time. The following works very well:

enter code here
% time y = ENS_MEAN.prec.sel(latitude=20, longitude=20)
% time ENS_MEAN.prec.sel(latitude=20, longitude=20)

Output:

CPU times: user 10.7 ms, sys: 908 µs, total: 11.6 ms
Wall time: 10.6 ms
CPU times: user 2.94 ms, sys: 0 ns, total: 2.94 ms
Wall time: 2.83 ms



Out[42]:
<xarray.DataArray 'prec' (time: 29)>
dask.array<getitem..., shape=(29,), dtype=float64, chunksize=(29,)>
Coordinates:
longitude  float32 20.0
latitude   float32 20.0
* time       (time) datetime64[ns] 1982-05-01 1983-05-01 1984-05-01 ...

BUT when I try to get actual values in numpy array format (see below) it takes up to 2 mins to convert. I am wondering if the issue has to do with chunk size?

%time np.array(y)

CPU times: user 2min 12s, sys: 47.1 s, total: 2min 59s
Wall time: 2min 20s

/home/......./anaconda3/lib/python3.5/site-
packages/dask/array/numpy_compat.py:45: RuntimeWarning: invalid value 
encountered in true_divide
x = np.divide(x1, x2, out)

Out[41]:
array([-0.00881837, -0.02694129,  0.03033962,  0.01635965, -0.01392146,
   -0.03904842, -0.00269604, -0.00114008,  0.0051511 , -0.02376757,
   -0.01574946, -0.01025411, -0.01544669, -0.02065624, -0.02342096,
   -0.01664323,  0.08460527,  0.04862781, -0.0035033 , -0.00721429,
   -0.00995117,  0.0263697 , -0.00358022,  0.00376811, -0.01527904,
   -0.00548013,  0.03295138, -0.01114444,  0.02648388])

Thanks so much for answering my question.

like image 949
Shrad Avatar asked Oct 18 '22 12:10

Shrad


1 Answers

In this case, nothing is actually computed until you call np.array() -- only the abstract computation graph is built up before then.

One simple thing that should work is to set a smaller chunk size when you load data from disk, e.g., ds = xarray.open_dataset(..., chunks={'latitude': 1, 'longitude': 1}). Dask is supposed optimize indexing operations away, but we've encountered some issues with this recently -- see this GitHub issue for discussion.

like image 95
shoyer Avatar answered Oct 20 '22 10:10

shoyer