Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering duplicates from pandas dataframe with preference based on additional column

Tags:

python

pandas

I would like to filter rows containing a duplicate in column X from a dataframe. However, if there are duplicates for a value in X, I would like to give preference to one of them based on the value of another column Y. For example:

import pandas as pd
print pd.__version__
x = pd.DataFrame([
    ['best', 'a', 'x'],
    ['worst', 'b', 'y'],
    ['best', 'c', 'x'],
    ['worst','d', 'y'],
    ['best','d', 'y'],
    ['worst','d', 'y'],
    ['best','d', 'z'],
    ['best','d', 'z'],
], columns=['a', 'b', 'c'])
print x
x.drop_duplicates(cols='c', inplace=True)
print x

       a  b  c
0   best  a  x
1  worst  b  y
2   best  c  x
3  worst  d  y
4   best  d  y
5  worst  d  y
6   best  d  z
7   best  d  z

       a  b  c
0   best  a  x
1  worst  b  y
6   best  d  z

I would like to give precedence to the duplicate with column a equal to best. Which would give the result:

       a  b  c
0   best  a  x
4   best  d  y
6   best  d  z

Any idea what is the correct way to do this in pandas? Is there a more general way than just sorting the rows such that removing all but the first occurrence of the duplicate does what you want?

like image 654
dylkot Avatar asked Mar 17 '26 15:03

dylkot


2 Answers

I think a more straightforward way is to first sort the DataFrame, then drop duplicates keeping the first entry. This is pretty robust (here, 'a' was a string with two values but you could apply a function that makes an integer column from the string if there were more string values to sort).

x = x.sort_values(['a']).drop_duplicates(cols='c')
like image 69
Andy Reagan Avatar answered Mar 19 '26 04:03

Andy Reagan


I think using two groupby statements will get you what you want. With slightly modified input data:

x = pd.DataFrame([
    ['best', 'a', 'x'],
    ['worst', 'b', 'y'],
    ['best', 'c', 'x'],
    ['worst','d', 'y'],
    ['worst','d', 'y'],
    ['best','d', 'y'],
    ['best','d', 'z'],
    ['best','d', 'z'],
], columns=['a', 'b', 'c'])

x.groupby(by=['c']) \
 .filter(lambda g: g['a'] == 'best') \
 .groupby(by=['b'], as_index=False) \
 .first() \
 .sort(axis=1)  # the columns get out of order in the second groupby

Which returns:

   b     a  c
0  a  best  x
1  c  best  x
2  d  best  z

It's still not 100% clear where this needs to go with your ambiguous example input/output. But I think we're getting close.

like image 41
Paul H Avatar answered Mar 19 '26 04:03

Paul H



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!