Lets say I have the following dataframe:
df1 = pd.DataFrame(data = [1,np.nan,np.nan,1,1,np.nan,1,1,1],
columns = ['X'],
index = ['a', 'a', 'a',
'b', 'b', 'b',
'c', 'c', 'c'])
print(df1)
X
a 1.0
a NaN
a NaN
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
I want to keep only the indices which have 2 or more non-NaN entries. In this case, the 'a' entries only have one non-NaN value, so I want to drop it and have my result be:
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
What is the best way to do this? Ideally I want something that works with Dask too, although usually if it works with Pandas it also works in Dask.
Let us try filter
out = df.groupby(level=0).filter(lambda x : x.isna().sum()<=1)
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
Or we do isin
df[df.index.isin(df.isna().sum(level=0).loc[lambda x : x['X']<=1].index)]
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
As another option, let's try filtering via GroupBy.transform
and boolean indexing:
df1[df1['X'].isna().groupby(df1.index).transform('sum') <= 1]
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
Or, almost the same way,
df1[df1.assign(X=df1['X'].isna()).groupby(level=0)['X'].transform('sum') <= 1]
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
You might have a good shot at getting this to work with Dask too.
I am new to dask , looked at some examples and docs , however the following seems to work;
from dask import dataframe as dd
sd = dd.from_pandas(df1, npartitions=3)
#converts X to boolean checking for isna() and the groupby on index and sum
s = sd.X.isna().groupby(sd.index).sum().compute()
#using the above we can boolean index to check if sum is less than 2 , then use loc
out_dd = sd.loc[list(s[s<2].index)]
out_dd.head(6,npartitions=-1)
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With