Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(pandas) Drop duplicates based on subset where order doesn't matter

Tags:

pandas

What is the proper way to go from this df:

>>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})
>>> df
      a     b
0  jeff   bob
1   bob  jeff
2  jill  mike

To this:

>>> df2
      a     b
0  jeff   bob
2  jill  mike

where you're dropping a duplicate row based on the items in 'a' and 'b', without regard to the their specific column.

I can hack together a solution using a lambda expression to create a mask and then drop duplicates based on the mask column, but I'm thinking there has to be a simpler way than this:

>>> df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]), \
 key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)
>>> df.drop_duplicates(subset='c', keep='first', inplace=True)
>>> df = df.iloc[:,:-1]
like image 735
yobogoya Avatar asked Jun 28 '17 03:06

yobogoya


2 Answers

I think you can sort each row independently and then use duplicated to see which ones to drop.

dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
df[~dupes]

A faster way to get dupes. Thanks to @DSM.

dupes = df.T.apply(sorted).T.duplicated()
like image 200
Ted Petrou Avatar answered Nov 08 '22 19:11

Ted Petrou


I think simpliest is use apply with axis=1 for sorted per rows and then call DataFrame.duplicated:

df = df[~df.apply(sorted, 1).duplicated()]
print (df)
      a     b
0  jeff   bob
2  jill  mike

A bit complicated, but very fast, is use numpy.sort with DataFrame constructor:

df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
df = df[~df1.duplicated()]
print (df)
      a     b
0  jeff   bob
2  jill  mike

Timings:

np.random.seed(123)
N = 10000
df = pd.DataFrame({'A': np.random.randint(100,size=N).astype(str),
                   'B': np.random.randint(100,size=N).astype(str)})
#print (df)

In [63]: %timeit (df[~pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).duplicated()])
100 loops, best of 3: 3.25 ms per loop

In [64]: %timeit (df[~df.apply(sorted, 1).duplicated()])
1 loop, best of 3: 1.09 s per loop

#Ted Petrou solution1
In [65]: %timeit (df[~df.apply(lambda x: x.sort_values().values, axis=1).duplicated()])
1 loop, best of 3: 2.89 s per loop

#Ted Petrou solution2
In [66]: %timeit (df[~df.T.apply(sorted).T.duplicated()])
1 loop, best of 3: 1.56 s per loop
like image 26
jezrael Avatar answered Nov 08 '22 20:11

jezrael