Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas drop duplicates if reverse is present between two columns

Tags:

python

pandas

I have a dataset with 2 columns like the following...

InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414

I'm using Pandas and I want to drop rows which are present twice but simply reversed like the following... from this...

InteractorA InteractorB
AGAP002741  AGAP008026
AGAP008026  AGAP002741

To this...

InteractorA InteractorB
AGAP002741  AGAP008026

As they are for all intents and purposes the same thing.

Is there a built in method to handle this?

like image 709
BML91 Avatar asked Jul 10 '14 12:07

BML91


1 Answers

I ended up making a hacky script which iterates over the rows and the necessary pieces of data and checks whether the concatenate appears or if its reverse appears and drops row indexes as appropriate.

import pandas as pd

checklist = []
indexes_to_drop = []

interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')

for index, row in interactions.iterrows():
    check_string = row['InteractorA'] + row['InteractorB']
    check_string_rev = row['InteractorB'] + row['InteractorA']
    if (check_string or check_string_rev) in checklist:
        indexes_to_drop.append(index)
    else:
        pass
    checklist.append(check_string)
    checklist.append(check_string_rev)

no_dups = interactions.drop(interactions.index[indexes_to_drop])

print no_dups.shape

no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)

2017 EDIT: a few years on, with a bit more experience, this is a much more elegant solution for anyone looking for something similar:

In [8]: df
Out[8]:
  InteractorA InteractorB
0  AGAP028204  AGAP005846
1  AGAP028204  AGAP003428
2  AGAP028200  AGAP011124
3  AGAP028200  AGAP004335
4  AGAP028200  AGAP011356
5  AGAP028194  AGAP008414
6  AGAP002741  AGAP008026
7  AGAP008026  AGAP002741

In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)

In [19]: df
Out[19]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026
7  AGAP008026  AGAP002741  AGAP002741AGAP008026

In [20]: df.drop_duplicates('check_string')
Out[20]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026
like image 106
BML91 Avatar answered Oct 01 '22 08:10

BML91