Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unstack and return value counts for each variable?

I have a data frame that records responses of 19717 people's choice of programing languages through multiple choice questions. The first column is of course the gender of the respondent while the rest are the choices they picked. The data frame is shown below, with each response being recorded as the same name as column. If no response is selected, then this results in a NaN.

ID     Gender              Python    Bash    R    JavaScript    C++
0      Male                Python    nan     nan  JavaScript    nan
1      Female              nan       nan     R    JavaScript    C++
2      Prefer not to say   Python    Bash    nan  nan           nan
3      Male                nan       nan     nan  nan           nan

What I want is a table that returns the count based on Gender. Hence if 5000 men coded in Python and 3000 women in JS, then I should get this:

Gender              Python    Bash    R    JavaScript    C++
Male                5000      1000    800  1500          1000
Female              4000      500     1500 3000          800
Prefer Not To Say   2000      ...   ...    ...           860

I have tried some of the options:

df.iloc[:, [*range(0, 13)]].stack().value_counts()

Male                       16138
Python                     12841
SQL                         6532
R                           4588
Female                      3212
Java                        2267
C++                         2256
Javascript                  2174
Bash                        2037
C                           1672
MATLAB                      1516
Other                       1148
TypeScript                   389
Prefer not to say            318
None                          83
Prefer to self-describe       49
dtype: int64

And it's not what is required as described above. Can this be done in pandas?

like image 740
shiv_90 Avatar asked Nov 25 '19 13:11

shiv_90


People also ask

What does unstack () do in Python?

Unstack is also similar to the stack method, it returns a DataFrame having a new level of column labels. It has 2 parameters which are level and fill_value. The level parameter takes an integer, string, list of these, and the Default value is 1 (1 is the last level).

What is the use of stack () and unstack () method in pandas?

Pandas provides various built-in methods for reshaping DataFrame. Among them, stack() and unstack() are the 2 most popular methods for restructuring columns and rows (also known as index). stack() : stack the prescribed level(s) from column to row. unstack() : unstack the prescribed level(s) from row to column.

How does Groupby value and count?

Group by and value_counts Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.


4 Answers

Another idea would be to apply join values along axis 1, get_dummies then groupby:

(df.loc[:, 'Python':]
 .apply(lambda x: '|'.join(x.dropna()), axis=1)
 .str.get_dummies('|')
 .groupby(df['Gender']).sum())

[out]

                   Bash  C++  JavaScript  Python  R
Gender                                             
Female                0    1           1       0  1
Male                  0    0           1       1  0
Prefer not to say     1    0           0       1  0
like image 96
Chris Adams Avatar answered Oct 11 '22 12:10

Chris Adams


You can set Gender as index and sum:

s = df.set_index('Gender').iloc[:, 1:]
s.eq(s.columns).astype(int).sum(level=0)

Output:

                   Python  Bash  R  JavaScript  C++
Gender                                             
Male                    1     0  0           1    0
Female                  0     0  1           1    1
Prefer not to say       1     1  0           0    0
like image 27
Quang Hoang Avatar answered Oct 11 '22 13:10

Quang Hoang


You can melt and use crosstab

df1 = pd.melt(df,id_vars=['ID','Gender'],var_name='Language',value_name='Choice')
df1['Choice'] = np.where(df1['Choice'] == df1['Language'],1,0)
final= pd.crosstab(df1['Gender'],df1['Language'],values=df1['Choice'],aggfunc='sum')

print(final)
Language              Bash  C++  JavaScript  Python  R
Gender                                              
Female                  0    1           1       0  1
Male                    0    0           1       1  0
Prefer not to say       1    0           0       1  0
like image 5
Umar.H Avatar answered Oct 11 '22 14:10

Umar.H


Assume your nan is NaN (i.e. it is not string), we may take advantage of count because it ignores NaN to get desired output

df_out = df.iloc[:,2:].groupby(df.Gender, sort=False).count()

Out[175]:
                   Python  Bash  R  JavaScript  C++
Gender
Male                    1     0  0           1    0
Female                  0     0  1           1    1
Prefer not to say       1     1  0           0    0
like image 4
Andy L. Avatar answered Oct 11 '22 12:10

Andy L.