Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Crosstabulation and counting

Tags:

python

pandas

I am using Python Pandas. I have got a column with a string and I would like to have the crossing between the columns.

E.g I have got the following input

1: Andi
2: Andi, Cindy
3: Thomas, Cindy
4: Cindy, Thomas

And I would like to have the following output:

Hence, the combination of Andi and Thomas does not appear in the data, but Cindy and Thomas appear twice.

          Andi  Thomas  Cindy
    Andi    1     0      1
    Thomas  0     1      2
    Cindy   1     2      1

Has somebody any idea how I could handle this? That would be really great!

Many thanks and regards,

Andi

like image 755
Andi Maier Avatar asked Jul 10 '17 16:07

Andi Maier


People also ask

How do you do cross tabulation in pandas?

The crosstab() function is used to compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed. Values to group by in the rows. Values to group by in the columns.

What is the difference between crosstab and pivot table?

With a basic crosstab, you would have to go back to the program and create a separate crosstab with the information on individual products. Pivot tables let the user filter through their data, add or remove custom fields, and change the appearance of their report.


1 Answers

You can generate the dummy columns first:

df['A'].str.get_dummies(', ')
Out: 
   Andi  Cindy  Thomas
0     1      0       0
1     1      1       0
2     0      1       1
3     0      1       1

And use that in the dot product:

tab = df['A'].str.get_dummies(', ')

tab.T.dot(tab)
Out: 
        Andi  Cindy  Thomas
Andi       2      1       0
Cindy      1      3       2
Thomas     0      2       2

Diagonal entries will give you the number of occurrences for each person. If you need to set the diagonals to 1, there are several alternatives. One of them is np.fill_diagonal from numpy.

co_occurrence = tab.T.dot(tab)    
np.fill_diagonal(co_occurrence.values, 1)    
co_occurrence
Out: 
        Andi  Cindy  Thomas
Andi       1      1       0
Cindy      1      1       2
Thomas     0      2       1
like image 198
ayhan Avatar answered Sep 30 '22 04:09

ayhan