find rows that share values

Question

I have a pandas dataframe that look like this:

df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})

-------------------------------
name |         favefood
-------------------------------
bob  | ['kfc', 'mcd', 'wendys']
tim  | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']

For each person, I want to find out how many favefood's of other people overlap with their own. I.e., for each person I want to find out how many other people have a non-empty intersection with them.

The resulting dataframe would look like this:

------------------------------
name |         overlap
------------------------------
bob  |            3
tim  |            2
jane |            2
john |            1
andy |            0

The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!

BENY · Accepted Answer

Logic behind it

s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]: 
0    3
1    2
2    2
3    1
4    0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1

Method from sklearn

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)

s.dot(s.T).ne(0).sum(axis=1)-1

0    3
1    2
2    2
3    1
4    0
dtype: int64

find rows that share values

Tags:

python

pandas

dataframe

intersection

Andrew Louis

1 Answers

BENY

Recent Activity

Donate For Us

find rows that share values

Tags:

python

pandas

dataframe

intersection

Andrew Louis

1 Answers

BENY

Related questions

Recent Activity

Donate For Us