I have tried to apply a filter to remove columns with too many NAs to my dask dataframe:
df.dropna(axis=1, how='all', thresh=round(len(df) * .8))
Unfortunately it seems that the dask dropna
API is slightly different from that of pandas and does not accept either an axis
nor a threshold
.
One partial way around it is to iterate column by column and remove those that are constant (regardless of whether they are filled with NAs or not, as I do not mind getting rid of constants):
for col in df.columns:
if len(df[col].unique()) == 1:
new_df = df.drop(col, axis = 1)
But this does not let me apply a threshold. I could compute the threshold manually by adding:
elif sum(df[col].isnull().compute()) / len(df[col]) > 0.8:
new_df = df.drop(col, axis = 1)
But I'm not sure calling compute
and len
at this point would be optimal and I would be curious to know if there are any better ways to go about this ?
Now Dask has axis
, thresh
, and subset
args that may help. The previous answer can be rewritten as:
df.dropna(subset=columns_to_inspect, thresh=threshold_to_drop_na, axis=1)
You're right, there is no way to do this by using df.dropna()
.
I would suggest using this equation
df.loc[:,df.isnull().sum()<THRESHOLD]
We had a similar problem and used the following code:
for col in df.columns:
if df[col].isnull().all().compute()=True:
df = df.drop(col,axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With