Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find symmetric pairs quickly in numpy

from itertools import product
import pandas as pd

df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
#     c1  c2
# 0    0   0
# 1    0   1
# 2    0   2
# 3    0   3
# 4    0   4
# ..  ..  ..
# 85   9   4
# 86   9   5
# 87   9   7
# 88   9   8
# 89   9   9
# 
# [90 rows x 2 columns]

How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?

An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.

The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.

like image 344
The Unfun Cat Avatar asked Oct 28 '19 14:10

The Unfun Cat


Video Answer


1 Answers

You can sort the values, then groupby:

a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()

Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:

a= np.sort(df.to_numpy(), axis=1) 

(df.assign(one=a[:,0], two=a[:,1])   # one and two can be changed
   .drop_duplicates(['one','two'])   # taken from above
   .reindex(df.columns, axis=1)
)
like image 150
Quang Hoang Avatar answered Sep 18 '22 13:09

Quang Hoang