Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

I have a pandas dataframe which contains duplicates values according to two columns (A and B):

A B C 1 2 1 1 2 4 2 7 1 3 4 0 3 4 8 

I want to remove duplicates keeping the row with max value in column C. This would lead to:

A B C 1 2 4 2 7 1 3 4 8 

I cannot figure out how to do that. Should I use drop_duplicates(), something else?

like image 776
Elsalex Avatar asked Aug 19 '15 11:08

Elsalex


People also ask

How do I delete duplicate rows based on multiple columns in pandas?

By using pandas. DataFrame. drop_duplicates() method you can drop/remove/delete duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.

How do I remove duplicate rows and keep the highest value only?

(1) Select Fruit column (which you will remove duplicates rows by), and then click the Primary Key button; (2) Select the Amount column (Which you will keep highest values in), and then click Calculate > Max. (3) Specify combination rules for other columns as you need. 3.

How do you remove duplicates from a DataFrame in Python based on column?

To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.

How do I remove duplicates in two columns?

1) Select a cell in the range => Data tab => Data Tools ribbon => click on the Remove Duplicates command button. 2) 'Remove Duplicates' dialog box appears. All the columns are by default selected.


2 Answers

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max) df = df.loc[df.C == c_maxes] 

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True) 

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT: From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last') 

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B']) 

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C] 10 loops, best of 3: 25.7 ms per loop  %timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last') 10 loops, best of 3: 101 ms per loop 
like image 150
JoeCondron Avatar answered Oct 03 '22 08:10

JoeCondron


You can do this simply by using pandas drop duplicates function

df.drop_duplicates(['A','B'],keep= 'last') 
like image 45
Sudharsan Avatar answered Oct 03 '22 08:10

Sudharsan