Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interaction between rows and columns in dataframe

I have a dataframe:

df = pd.DataFrame({
    'exam': [
        'French', 'English', 'German', 'Russian', 'Russian',
        'German', 'German', 'French', 'English', 'French'
    ],

'student' : ['john', 'ted', 'jason', 'marc', 'peter', 'bob',
            'robert', 'david', 'nik', 'kevin'
]
})

print (df)

              exam   student   
    0       French    john     
    1       English   ted        
    2       German    jason         
    3       Russian   marc         
    4       Russian   peter         
    5       German    bob         
    6       German    robert         
    7       French    david         
    8       English   nik          
    9       French    kevin         

Does anybody know how to create a new dataframe containing two columns "student" and "student shared exam with".

I should get something like:

                student   shared_exam_with      
        0       john       david                   
        1       john       kevin            
        2       ted        nik                    
        3       jason      bob                 
        4       jason      robert                   
        5       marc       peter              
        6       peter      marc             
        7       bob        jason                    
        8       bob        robert                    
        9       robert     jason                      
       10       robert     bob                   
       11       david      john             
       12       david      kevin                      
       13       nik        ted                     
       14       kevin      john                     
       15       kevin      david                   

For ex: John took French..and David and Kevin too!

like image 852
Sheron Avatar asked Dec 13 '22 23:12

Sheron


2 Answers

self merge

df.merge(
    df, on='exam',
    suffixes=['', '_shared_with']
).query('student != student_shared_with')

       exam student student_shared_with
1    French    john               david
2    French    john               kevin
3    French   david                john
5    French   david               kevin
6    French   kevin                john
7    French   kevin               david
10  English     ted                 nik
11  English     nik                 ted
14   German   jason                 bob
15   German   jason              robert
16   German     bob               jason
18   German     bob              robert
19   German  robert               jason
20   German  robert                 bob
23  Russian    marc               peter
24  Russian   peter                marc

self join

d1 = df.set_index('exam')
d1.join(
    d1, rsuffix='_shared_with'
).query('student != student_shared_with')

        student student_shared_with
exam                               
English     ted                 nik
English     nik                 ted
French     john               david
French     john               kevin
French    david                john
French    david               kevin
French    kevin                john
French    kevin               david
German    jason                 bob
German    jason              robert
German      bob               jason
German      bob              robert
German   robert               jason
German   robert                 bob
Russian    marc               peter
Russian   peter                marc

itertools.permutations + groupby

from itertools import permutations as perm

cols = ['student', 'student_shared_with']
df.groupby('exam').student.apply(
    lambda x: pd.DataFrame(list(perm(x, 2)), columns=cols)
).reset_index(drop=True)

   student student_shared_with
0      ted                 nik
1      nik                 ted
2     john               david
3     john               kevin
4    david                john
5    david               kevin
6    kevin                john
7    kevin               david
8    jason                 bob
9    jason              robert
10     bob               jason
11     bob              robert
12  robert               jason
13  robert                 bob
14    marc               peter
15   peter                marc
like image 116
piRSquared Avatar answered Dec 27 '22 09:12

piRSquared


One way would be:

cross = pd.crosstab(df['student'], df['exam'])
res = cross.dot(cross.T)
res.where(np.triu(res, k=1).astype('bool')).stack()
Out: 
student  student
bob      jason      1.0
         robert     1.0
david    john       1.0
         kevin      1.0
jason    robert     1.0
john     kevin      1.0
marc     peter      1.0
nik      ted        1.0
dtype: float64

The dot product generates a binary matrix for the co occurrences. In order not to repeat the same pairs, I filter them with where and stack. The index of the resulting Series is the students that have the same exam.

like image 30
ayhan Avatar answered Dec 27 '22 11:12

ayhan