Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Complex filtering in dask DataFrame

Tags:

I'm used to doing "complex" filtering on pandas DataFrame objects:

import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.random((10000, 2)) * 512, columns=["x", "y"])
data2 = data[np.sqrt((data.x - 200)**2 + (data.y - 200)**2) < 1]

This produces no problems.

But with dask DataFrames I have:

ddata = dask.dataframe.from_pandas(data, 8)
ddata2 = ddata[np.sqrt((ddata.x - 200)**2 + (ddata.y - 200)**2) < 1]
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-13-c2acf73dddf6> in <module>()
----> 1 ddata2 = ddata[np.sqrt((ddata.x - 200)**2 + (ddata.y - 200)**2) < 1]

~/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py in __getitem__(self, key)
   2115             return new_dd_object(merge(self.dask, key.dask, dsk), name,
   2116                                  self, self.divisions)
-> 2117         raise NotImplementedError(key)
   2118 
   2119     def __setitem__(self, key, value):

NotImplementedError: 0       False

Meanwhile a simpler operation:

ddata2 = ddata[ddata.x < 200]

works fine.

I think the issue is that as soon as I do any "complex" math (i.e. the np.sqrt) the result is no longer a lazy dask DataFrame.

Is there a way around this? Do I have to create a new column that I can then filter on or is there some better way?

like image 807
David Hoffman Avatar asked Aug 15 '17 02:08

David Hoffman


1 Answers

If you replace np.sqrt with da.sqrt then everything works fine.

import dask.array as da

You may notice that np.sqrt of a dask series produces a numpy array, so this step in your computation is not lazy, but forces a concrete result. Use the dask equivalent function to maintain laziness and to keep everything dask-compliant.

like image 72
MRocklin Avatar answered Sep 30 '22 13:09

MRocklin