Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

Tags:

I have a pandas dataframe which contains duplicates values according to two columns (A and B):

A B C 1 2 1 1 2 4 2 7 1 3 4 0 3 4 8

I want to remove duplicates keeping the row with max value in column C. This would lead to:

A B C 1 2 4 2 7 1 3 4 8

I cannot figure out how to do that. Should I use drop_duplicates(), something else?

776

asked Aug 19 '15 11:08

Elsalex

2 Answers

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max) df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT: From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C] 10 loops, best of 3: 25.7 ms per loop  %timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last') 10 loops, best of 3: 101 ms per loop

150

answered Oct 03 '22 08:10

JoeCondron

You can do this simply by using pandas drop duplicates function

df.drop_duplicates(['A','B'],keep= 'last')

answered Oct 03 '22 08:10

Sudharsan

Related questions
                            
                                TypeError: a bytes-like object is required, not 'str'
                            
                                Force Overwrite in Os.Rename
                            
                                Django required field in model form
                            
                                split python source code into multiple files?
                            
                                HTTPError 403 (Forbidden) with Django and python-social-auth connecting to Google with OAuth2
                            
                                Understanding *x ,= lst
                            
                                What is %pylab?
                            
                                Parsing time string in Python
                            
                                matplotlib bar graph black - how do I remove bar borders
                            
                                RFC 1123 Date Representation in Python?
                            
                                How do I call the Python's list while debugging?
                            
                                Roc curve and cut off point. Python
                            
                                Understanding torch.nn.Parameter
                            
                                Python 3 TypeError: must be str, not bytes with sys.stdout.write()
                            
                                What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
                            
                                Is there a simple process-based parallel map for python?
                            
                                How to test with Python's unittest that a warning has been thrown?
                            
                                Insert line at middle of file with Python?
                            
                                Python read JSON file and modify
                            
                                How to draw a line with matplotlib?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

Tags:

python

pandas

dataframe

duplicates

Elsalex

People also ask

2 Answers

JoeCondron

Sudharsan

Recent Activity

Donate For Us