I have data similar to:
id value duplicate
a 200 yes
a 12 yes
b 42 yes
c 12 no
b 532 yes
b 21 yes
...
To track the duplicates I use df['duplicate'] = df.duplicated('id', keep=False)
However, I would like to keep the ones with the highest value
and either mark or drop the other duplicates. Any suggestions?
Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated. The value or values in a set of duplicates to mark as missing.
Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence.
You can use the duplicated() function to find duplicate values in a pandas DataFrame.
Ah I don't know why I didn't think of this first.
df.sort(['id', 'value'])
df['is_duplicated'] = df.duplicated('id', keep='first')
sorry!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With