Removing duplicates based on two columns while deleting inconsistent data

Question

I have a pandas dataframe like this:

where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:

   a  b  c
0  2  4  1
1  3  5  0

The real dataframe can have more than two duplicates per group and as you can see also the index has been changed. Thanks.

jezrael · Accepted Answer

First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:

df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
           .drop_duplicates(['a','b'])
           .reset_index(drop=True))
print (df)
   a  b  c
0  2  4  1
1  3  5  0

Detail:

print (df.groupby(['a','b'])['c'].transform('nunique'))
0    2
1    2
2    1
3    1
4    1
Name: c, dtype: int64

Removing duplicates based on two columns while deleting inconsistent data

Tags:

python

python-3.x

pandas

Simosini

1 Answers

jezrael

Recent Activity

Donate For Us

Removing duplicates based on two columns while deleting inconsistent data

Tags:

python

python-3.x

pandas

Simosini

1 Answers

jezrael

Related questions

Recent Activity

Donate For Us