Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you drop infs from dask dataframe/series?

Tags:

dask

I have a dask Series from which I need to drop both infs and nans. .dropna() only drops the nans. In numpy/pandas, I would do something like result = result[np.isfinite(result)]. What's the recommended equivalent in dask-land? Indexing the dask object with a boolean array gives an error. Is there some way to tell dask that inf or -inf should be considered null values, for example?

like image 974
Tim Morton Avatar asked Apr 11 '26 21:04

Tim Morton


1 Answers

You should avoid using NumPy functions. These will trigger computation and future dask.dataframe operations will be hesitant about using those results.

Instead, use the equivalent dask.array function. Here is a minimal example.

In [1]: import numpy as np
   ...: import pandas as pd
   ...: import dask.dataframe as dd
   ...: import dask.array as da
   ...: df = pd.DataFrame({'x': [0, 1, 2], 'y': [0, np.inf, 5]})
   ...: df
   ...: 
Out[1]: 
   x         y
0  0  0.000000
1  1       inf
2  2  5.000000

In [2]: ddf = dd.from_pandas(df, npartitions=2)
   ...: ddf[~da.isinf(ddf.y)].compute()
   ...: 
Out[2]: 
   x    y
0  0  0.0
2  2  5.0
like image 53
MRocklin Avatar answered Apr 19 '26 03:04

MRocklin