I have a dataframe:
df = pd.DataFrame({
'exam': [
'French', 'English', 'German', 'Russian', 'Russian',
'German', 'German', 'French', 'English', 'French'
],
'student' : ['john', 'ted', 'jason', 'marc', 'peter', 'bob',
'robert', 'david', 'nik', 'kevin'
]
})
print (df)
exam student
0 French john
1 English ted
2 German jason
3 Russian marc
4 Russian peter
5 German bob
6 German robert
7 French david
8 English nik
9 French kevin
Does anybody know how to create a new dataframe containing two columns "student" and "student shared exam with".
I should get something like:
student shared_exam_with
0 john david
1 john kevin
2 ted nik
3 jason bob
4 jason robert
5 marc peter
6 peter marc
7 bob jason
8 bob robert
9 robert jason
10 robert bob
11 david john
12 david kevin
13 nik ted
14 kevin john
15 kevin david
For ex: John took French..and David and Kevin too!
self merge
df.merge(
df, on='exam',
suffixes=['', '_shared_with']
).query('student != student_shared_with')
exam student student_shared_with
1 French john david
2 French john kevin
3 French david john
5 French david kevin
6 French kevin john
7 French kevin david
10 English ted nik
11 English nik ted
14 German jason bob
15 German jason robert
16 German bob jason
18 German bob robert
19 German robert jason
20 German robert bob
23 Russian marc peter
24 Russian peter marc
self join
d1 = df.set_index('exam')
d1.join(
d1, rsuffix='_shared_with'
).query('student != student_shared_with')
student student_shared_with
exam
English ted nik
English nik ted
French john david
French john kevin
French david john
French david kevin
French kevin john
French kevin david
German jason bob
German jason robert
German bob jason
German bob robert
German robert jason
German robert bob
Russian marc peter
Russian peter marc
itertools.permutations
+ groupby
from itertools import permutations as perm
cols = ['student', 'student_shared_with']
df.groupby('exam').student.apply(
lambda x: pd.DataFrame(list(perm(x, 2)), columns=cols)
).reset_index(drop=True)
student student_shared_with
0 ted nik
1 nik ted
2 john david
3 john kevin
4 david john
5 david kevin
6 kevin john
7 kevin david
8 jason bob
9 jason robert
10 bob jason
11 bob robert
12 robert jason
13 robert bob
14 marc peter
15 peter marc
One way would be:
cross = pd.crosstab(df['student'], df['exam'])
res = cross.dot(cross.T)
res.where(np.triu(res, k=1).astype('bool')).stack()
Out:
student student
bob jason 1.0
robert 1.0
david john 1.0
kevin 1.0
jason robert 1.0
john kevin 1.0
marc peter 1.0
nik ted 1.0
dtype: float64
The dot product generates a binary matrix for the co occurrences. In order not to repeat the same pairs, I filter them with where and stack. The index of the resulting Series is the students that have the same exam.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With