Filtering duplicates from pandas dataframe with preference based on additional column

Question

I would like to filter rows containing a duplicate in column X from a dataframe. However, if there are duplicates for a value in X, I would like to give preference to one of them based on the value of another column Y. For example:

import pandas as pd
print pd.__version__
x = pd.DataFrame([
    ['best', 'a', 'x'],
    ['worst', 'b', 'y'],
    ['best', 'c', 'x'],
    ['worst','d', 'y'],
    ['best','d', 'y'],
    ['worst','d', 'y'],
    ['best','d', 'z'],
    ['best','d', 'z'],
], columns=['a', 'b', 'c'])
print x
x.drop_duplicates(cols='c', inplace=True)
print x

       a  b  c
0   best  a  x
1  worst  b  y
2   best  c  x
3  worst  d  y
4   best  d  y
5  worst  d  y
6   best  d  z
7   best  d  z

       a  b  c
0   best  a  x
1  worst  b  y
6   best  d  z

I would like to give precedence to the duplicate with column a equal to best. Which would give the result:

       a  b  c
0   best  a  x
4   best  d  y
6   best  d  z

Any idea what is the correct way to do this in pandas? Is there a more general way than just sorting the rows such that removing all but the first occurrence of the duplicate does what you want?

Andy Reagan · Accepted Answer

I think a more straightforward way is to first sort the DataFrame, then drop duplicates keeping the first entry. This is pretty robust (here, 'a' was a string with two values but you could apply a function that makes an integer column from the string if there were more string values to sort).

x = x.sort_values(['a']).drop_duplicates(cols='c')

Paul H · Answer

I think using two groupby statements will get you what you want. With slightly modified input data:

x = pd.DataFrame([
    ['best', 'a', 'x'],
    ['worst', 'b', 'y'],
    ['best', 'c', 'x'],
    ['worst','d', 'y'],
    ['worst','d', 'y'],
    ['best','d', 'y'],
    ['best','d', 'z'],
    ['best','d', 'z'],
], columns=['a', 'b', 'c'])

x.groupby(by=['c']) \
 .filter(lambda g: g['a'] == 'best') \
 .groupby(by=['b'], as_index=False) \
 .first() \
 .sort(axis=1)  # the columns get out of order in the second groupby

Which returns:

   b     a  c
0  a  best  x
1  c  best  x
2  d  best  z

It's still not 100% clear where this needs to go with your ambiguous example input/output. But I think we're getting close.

Filtering duplicates from pandas dataframe with preference based on additional column

Tags:

python

pandas

dylkot

2 Answers

Andy Reagan

Paul H

Recent Activity

Donate For Us

Filtering duplicates from pandas dataframe with preference based on additional column

Tags:

python

pandas

dylkot

2 Answers

Andy Reagan

Paul H

Related questions

Recent Activity

Donate For Us