Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find rows that share values

I have a pandas dataframe that look like this:

df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name |         favefood
-------------------------------
bob  | ['kfc', 'mcd', 'wendys']
tim  | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']

For each person, I want to find out how many favefood's of other people overlap with their own. I.e., for each person I want to find out how many other people have a non-empty intersection with them.

The resulting dataframe would look like this:

------------------------------
name |         overlap
------------------------------
bob  |            3
tim  |            2
jane |            2
john |            1
andy |            0 

The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!

like image 910
Andrew Louis Avatar asked Dec 08 '25 08:12

Andrew Louis


1 Answers

Logic behind it

s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]: 
0    3
1    2
2    2
3    1
4    0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1

Method from sklearn

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)

s.dot(s.T).ne(0).sum(axis=1)-1

0    3
1    2
2    2
3    1
4    0
dtype: int64
like image 136
BENY Avatar answered Dec 10 '25 22:12

BENY



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!