I'm used to doing "complex" filtering on pandas DataFrame objects:
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.random((10000, 2)) * 512, columns=["x", "y"])
data2 = data[np.sqrt((data.x - 200)**2 + (data.y - 200)**2) < 1]
This produces no problems.
But with dask DataFrames I have:
ddata = dask.dataframe.from_pandas(data, 8)
ddata2 = ddata[np.sqrt((ddata.x - 200)**2 + (ddata.y - 200)**2) < 1]
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-13-c2acf73dddf6> in <module>()
----> 1 ddata2 = ddata[np.sqrt((ddata.x - 200)**2 + (ddata.y - 200)**2) < 1]
~/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py in __getitem__(self, key)
2115 return new_dd_object(merge(self.dask, key.dask, dsk), name,
2116 self, self.divisions)
-> 2117 raise NotImplementedError(key)
2118
2119 def __setitem__(self, key, value):
NotImplementedError: 0 False
Meanwhile a simpler operation:
ddata2 = ddata[ddata.x < 200]
works fine.
I think the issue is that as soon as I do any "complex" math (i.e. the np.sqrt
) the result is no longer a lazy dask DataFrame.
Is there a way around this? Do I have to create a new column that I can then filter on or is there some better way?
If you replace np.sqrt
with da.sqrt
then everything works fine.
import dask.array as da
You may notice that np.sqrt
of a dask series produces a numpy array, so this step in your computation is not lazy, but forces a concrete result. Use the dask equivalent function to maintain laziness and to keep everything dask-compliant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With