python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B

People also ask

How do I remove duplicate rows and keep the highest value only?

1. If you want to remove all duplicates but leave the highest ones, you can apply this formula =MAX(IF($A$2:$A$12=D2,$B$2:$B$12)), remember to press Shift + Ctrl + Enter keys. 2. In the above formulas, A2:A12 is the original list you need to remove duplicates from.

How do you delete rows based on duplicates in one column in Python?

To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column's value, you can sort_values(colname) and specify keep equals either first or last .

How do I delete duplicate rows based on multiple columns in pandas?

Delete Duplicate Rows based on Specific Columns To delete duplicate rows on the basis of multiple columns, specify all column names as a list. You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows.

How do you drop duplicate rows in pandas based on a column?

Pandas drop_duplicates function has an argument to specify which columns we need to use to identify duplicates. For example, to remove duplicate rows using the column 'continent', we can use the argument “subset” and specify the column name we want to identify duplicate.

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby

Try this:

df.groupby(['A']).max()

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

Related questions
                            
                                pyplot axes labels for subplots
                            
                                How do I load a file into the python console?
                            
                                Python app does not print anything when running detached in docker
                            
                                Django Server Error: port is already in use
                            
                                Clear variable in python
                            
                                When should I use uuid.uuid1() vs. uuid.uuid4() in python?
                            
                                Python argparse: default value or specified value
                            
                                How do you programmatically set an attribute?
                            
                                How to sort with lambda in Python
                            
                                TypeError: got multiple values for argument
                            
                                Override Python's 'in' operator?
                            
                                Check if a number is int or float
                            
                                Transposing a 1D NumPy array
                            
                                Format a datetime into a string with milliseconds
                            
                                How do I format a string using a dictionary in python-3.x?
                            
                                Controlling mouse with Python
                            
                                matplotlib Legend Markers Only Once
                            
                                Why does Python's hash of infinity have the digits of π?
                            
                                How to use XPath in Python?
                            
                                proper name for python * operator?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B

Tags:

python

pandas

duplicates

People also ask

Recent Activity

Donate For Us