When I drop John
as duplicate specifying 'name' as the column name:
import pandas as pd
data = {'name':['Bill','Steve','John','John','John'], 'age':[21,28,22,30,29]}
df = pd.DataFrame(data)
df = df.drop_duplicates('name')
pandas drops all matching entities leaving the left-most:
age name
0 21 Bill
1 28 Steve
2 22 John
Instead I would like to keep the row where John's age is the highest (in this example it is the age 30. How to achieve this?
To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column's value, you can sort_values(colname) and specify keep equals either first or last .
Pandas drop_duplicates() Function Syntax keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.
Try this:
In [75]: df
Out[75]:
age name
0 21 Bill
1 28 Steve
2 22 John
3 30 John
4 29 John
In [76]: df.sort_values('age').drop_duplicates('name', keep='last')
Out[76]:
age name
0 21 Bill
1 28 Steve
3 30 John
or this depending on your goals:
In [77]: df.drop_duplicates('name', keep='last')
Out[77]:
age name
0 21 Bill
1 28 Steve
4 29 John
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With