Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask: Drop NAs on columns?

I have tried to apply a filter to remove columns with too many NAs to my dask dataframe:

df.dropna(axis=1, how='all', thresh=round(len(df) * .8))

Unfortunately it seems that the dask dropna API is slightly different from that of pandas and does not accept either an axis nor a threshold. One partial way around it is to iterate column by column and remove those that are constant (regardless of whether they are filled with NAs or not, as I do not mind getting rid of constants):

    for col in df.columns:
        if len(df[col].unique()) == 1:
            new_df = df.drop(col, axis = 1)

But this does not let me apply a threshold. I could compute the threshold manually by adding:

elif sum(df[col].isnull().compute()) / len(df[col]) > 0.8:
    new_df = df.drop(col, axis = 1)

But I'm not sure calling compute and len at this point would be optimal and I would be curious to know if there are any better ways to go about this ?

like image 865
Robert T. Tusk Avatar asked Oct 17 '18 08:10

Robert T. Tusk


2 Answers

Update 10 Aug 2021:

Now Dask has axis, thresh, and subset args that may help. The previous answer can be rewritten as:

df.dropna(subset=columns_to_inspect, thresh=threshold_to_drop_na, axis=1)

Old answer

You're right, there is no way to do this by using df.dropna().

I would suggest using this equation df.loc[:,df.isnull().sum()<THRESHOLD]

like image 101
Starukhin Yaroslav Avatar answered Nov 16 '22 02:11

Starukhin Yaroslav


We had a similar problem and used the following code:

for col in df.columns:
    if df[col].isnull().all().compute()=True:
        df = df.drop(col,axis=1) 
like image 45
skibee Avatar answered Nov 16 '22 01:11

skibee