I have a pandas dataframe which contains duplicates values according to two columns (A and B):
A B C 1 2 1 1 2 4 2 7 1 3 4 0 3 4 8
I want to remove duplicates keeping the row with max value in column C. This would lead to:
A B C 1 2 4 2 7 1 3 4 8
I cannot figure out how to do that. Should I use drop_duplicates()
, something else?
By using pandas. DataFrame. drop_duplicates() method you can drop/remove/delete duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.
(1) Select Fruit column (which you will remove duplicates rows by), and then click the Primary Key button; (2) Select the Amount column (Which you will keep highest values in), and then click Calculate > Max. (3) Specify combination rules for other columns as you need. 3.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
1) Select a cell in the range => Data tab => Data Tools ribbon => click on the Remove Duplicates command button. 2) 'Remove Duplicates' dialog box appears. All the columns are by default selected.
You can do it using group by:
c_maxes = df.groupby(['A', 'B']).C.transform(max) df = df.loc[df.C == c_maxes]
c_maxes
is a Series
of the maximum values of C
in each group but which is of the same length and with the same index as df
. If you haven't used .transform
then printing c_maxes
might be a good idea to see how it works.
Another approach using drop_duplicates
would be
df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)
Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.
EDIT: From pandas 0.18
up the second solution would be
df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
or, alternatively,
df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])
In any case, the groupby
solution seems to be significantly more performing:
%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C] 10 loops, best of 3: 25.7 ms per loop %timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last') 10 loops, best of 3: 101 ms per loop
You can do this simply by using pandas drop duplicates function
df.drop_duplicates(['A','B'],keep= 'last')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With