Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B

People also ask

How do I remove duplicate rows and keep the highest value only?

1. If you want to remove all duplicates but leave the highest ones, you can apply this formula =MAX(IF($A$2:$A$12=D2,$B$2:$B$12)), remember to press Shift + Ctrl + Enter keys. 2. In the above formulas, A2:A12 is the original list you need to remove duplicates from.

How do you delete rows based on duplicates in one column in Python?

To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column's value, you can sort_values(colname) and specify keep equals either first or last .

How do I delete duplicate rows based on multiple columns in pandas?

Delete Duplicate Rows based on Specific Columns To delete duplicate rows on the basis of multiple columns, specify all column names as a list. You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows.

How do you drop duplicate rows in pandas based on a column?

Pandas drop_duplicates function has an argument to specify which columns we need to use to identify duplicates. For example, to remove duplicate rows using the column 'continent', we can use the argument “subset” and specify the column name we want to identify duplicate.


This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()


Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby


Try this:

df.groupby(['A']).max()

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)