I have this dataframe
Id,ProductId,Product
1,100,a
1,100,x
1,100,NaN
2,150,NaN
3,150,NaN
4,100,a
4,100,x
4,100,NaN
Here I want to remove some of the rows which contains NaN and some I don't want to remove. The removing criteria is as follow. I want to remove only those NaNs rows whose Id already contains the value in Product columns. for example, here Id1 has already value in Product columns and still contains NaN, so I want to remove that row. But for id2, there exists only NaN in Product column. So I don't want to remove that one. Similarly for Id3 also, there is only NaN values in the Product columns and I want to keep it that one too.
Final Output would be like this
Id,ProductId,Product
1,100,a
1,100,x
2,150,NaN
3,150,NaN
4,100,a
4,100,x
Dont use groupby
if exist alternative, because slow.
vals = df.loc[df['Product'].notnull(), 'Id'].unique()
df = df[~(df['Id'].isin(vals) & df['Product'].isnull())]
print (df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
Explanation:
First get all Id
with some non missing values:
print (df.loc[df['Product'].notnull(), 'Id'].unique())
[1 4]
Then check these groups with missing values:
print (df['Id'].isin(vals) & df['Product'].isnull())
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
dtype: bool
Invert boolean mask:
print (~(df['Id'].isin(vals) & df['Product'].isnull()))
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 False
dtype: bool
And last filter by boolean indexing
:
print (df[~(df['Id'].isin(vals) & df['Product'].isnull())])
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
You can group the dataframe by Id
and drop the NaN
if the group has more than one element:
>> df.groupby(level='Id', group_keys=False
).apply(lambda x: x.dropna() if len(x) > 1 else x)
ProductId Product
Id
1 100 a
1 100 x
2 150 NaN
3 150 NaN
4 100 a
4 100 x
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With