Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove NaNs if only value already exists in corresponding Ids pandas

Tags:

python

pandas

I have this dataframe

 Id,ProductId,Product
  1,100,a
  1,100,x
  1,100,NaN
  2,150,NaN
  3,150,NaN
  4,100,a
  4,100,x
  4,100,NaN

Here I want to remove some of the rows which contains NaN and some I don't want to remove. The removing criteria is as follow. I want to remove only those NaNs rows whose Id already contains the value in Product columns. for example, here Id1 has already value in Product columns and still contains NaN, so I want to remove that row. But for id2, there exists only NaN in Product column. So I don't want to remove that one. Similarly for Id3 also, there is only NaN values in the Product columns and I want to keep it that one too.

Final Output would be like this

Id,ProductId,Product
1,100,a
1,100,x
2,150,NaN
3,150,NaN
4,100,a
4,100,x
like image 469
Ben Avatar asked Jan 01 '23 15:01

Ben


2 Answers

Dont use groupby if exist alternative, because slow.

vals = df.loc[df['Product'].notnull(), 'Id'].unique()
df = df[~(df['Id'].isin(vals) & df['Product'].isnull())]
print (df)
   Id  ProductId Product
0   1        100       a
1   1        100       x
3   2        150     NaN
4   3        150     NaN
5   4        100       a
6   4        100       x

Explanation:

First get all Id with some non missing values:

print (df.loc[df['Product'].notnull(), 'Id'].unique())
[1 4]

Then check these groups with missing values:

print (df['Id'].isin(vals) & df['Product'].isnull())
0    False
1    False
2     True
3    False
4    False
5    False
6    False
7     True
dtype: bool

Invert boolean mask:

print (~(df['Id'].isin(vals) & df['Product'].isnull()))
0     True
1     True
2    False
3     True
4     True
5     True
6     True
7    False
dtype: bool

And last filter by boolean indexing:

print (df[~(df['Id'].isin(vals) & df['Product'].isnull())])
   Id  ProductId Product
0   1        100       a
1   1        100       x
3   2        150     NaN
4   3        150     NaN
5   4        100       a
6   4        100       x
like image 137
jezrael Avatar answered Jan 05 '23 16:01

jezrael


You can group the dataframe by Id and drop the NaN if the group has more than one element:

>> df.groupby(level='Id', group_keys=False
             ).apply(lambda x: x.dropna() if len(x) > 1 else x)

    ProductId Product
Id                   
1         100       a
1         100       x
2         150     NaN
3         150     NaN
4         100       a
4         100       x
like image 39
Mabel Villalba Avatar answered Jan 05 '23 16:01

Mabel Villalba