I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With