I am trying to drop duplicates, but based on some conditions. My dataframe looks like this:
idx a b c d e f
1 1 ss1 0 25 A B
2 3 ss7 0 25 A B
3 5 ss5 0 12 C D
4 11 im3 0 12 C D
5 5 ss8 0 50 C K
6 9 im8 0 5 F G
7 8 ix6 0 5 F G
Rows are considered duplicates if the values of columns d, e and f together match other records in the dataframe subset=['d', 'e', 'f']. For example, rows 1 and 2 are duplicates, rows 3 and 4 are duplicates, and rows 6 and 7 are duplicates. The selection of which row to drop is based on column b.
If the value in column b begins with ss for both duplicates (rows 1 and 2), then anyone can be dropped
If one of the duplicates begins with ss and the other begins with a different format (rows 3 and 4), then the one that begins with ss should be kept.
If both duplicates in column b begin with anything other than ss (rows 6 and 7), then anyone can be selected.
Therefore, the expected output should be something like this:
idx a b c d e f
2 3 ss7 0 25 A B
3 5 ss5 0 12 C D
5 5 ss8 0 50 C K
7 8 ix6 0 5 F G
Sort by b key first (everything starts by 'ss' is moved to the end) then drop duplicates from ['d', 'e', 'f'] (keep the last):
out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
.drop_duplicates(['d', 'e', 'f'], keep='last').sort_index())
# OR
out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
.groupby(['d', 'e', 'f'], as_index=False).nth(-1).sort_index())
Output:
>>> out
idx a b c d e f
1 2 3 ss7 0 25 A B
2 3 5 ss5 0 12 C D
4 5 5 ss8 0 50 C K
6 7 8 ix6 0 5 F G
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With