I am trying to efficiently remove duplicates in Pandas in which duplicates are inverted across two columns. For example, in this data frame:
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
print frame
p1 p2 value
0 a b 1
1 b a 1
2 a c 2
3 a d 3
4 b c 5
5 d a 3
6 c b 5
I would want to remove rows 1, 5 and 6, leaving me with just:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5
Thanks in advance for ideas on how to do this.
Reorder the p1 and p2 values so they appear in a canonical order:
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
yields
In [149]: df
Out[149]:
p1 p2 value first second
0 a b 1 a b
1 b a 1 a b
2 a c 2 a c
3 a d 3 a d
4 b c 5 b c
5 d a 3 a d
6 c b 5 b c
Then you can drop_duplicates:
df = df.drop_duplicates(subset=['value', 'first', 'second'])
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
df = df.drop_duplicates(subset=['value', 'first', 'second'])
df = df[['p1', 'p2', 'value']]
yields
In [151]: df
Out[151]:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With