I have a dask Series from which I need to drop both infs and nans. .dropna() only drops the nans. In numpy/pandas, I would do something like result = result[np.isfinite(result)]. What's the recommended equivalent in dask-land? Indexing the dask object with a boolean array gives an error. Is there some way to tell dask that inf or -inf should be considered null values, for example?
You should avoid using NumPy functions. These will trigger computation and future dask.dataframe operations will be hesitant about using those results.
Instead, use the equivalent dask.array function. Here is a minimal example.
In [1]: import numpy as np
...: import pandas as pd
...: import dask.dataframe as dd
...: import dask.array as da
...: df = pd.DataFrame({'x': [0, 1, 2], 'y': [0, np.inf, 5]})
...: df
...:
Out[1]:
x y
0 0 0.000000
1 1 inf
2 2 5.000000
In [2]: ddf = dd.from_pandas(df, npartitions=2)
...: ddf[~da.isinf(ddf.y)].compute()
...:
Out[2]:
x y
0 0 0.0
2 2 5.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With