I have a data frame that records responses of 19717 people's choice of programing languages through multiple choice questions. The first column is of course the gender of the respondent while the rest are the choices they picked. The data frame is shown below, with each response being recorded as the same name as column. If no response is selected, then this results in a <code>NaN</code>. <pre class="prettyprint"><code>ID Gender Python Bash R JavaScript C++ 0 Male Python nan nan JavaScript nan 1 Female nan nan R JavaScript C++ 2 Prefer not to say Python Bash nan nan nan 3 Male nan nan nan nan nan </code></pre> What I want is a table that returns the count based on <code>Gender</code>. Hence if 5000 men coded in Python and 3000 women in JS, then I should get this: <pre class="prettyprint"><code>Gender Python Bash R JavaScript C++ Male 5000 1000 800 1500 1000 Female 4000 500 1500 3000 800 Prefer Not To Say 2000 ... ... ... 860 </code></pre> I have tried some of the options: <pre class="prettyprint"><code>df.iloc[:, [*range(0, 13)]].stack().value_counts() Male 16138 Python 12841 SQL 6532 R 4588 Female 3212 Java 2267 C++ 2256 Javascript 2174 Bash 2037 C 1672 MATLAB 1516 Other 1148 TypeScript 389 Prefer not to say 318 None 83 Prefer to self-describe 49 dtype: int64 </code></pre> And it's not what is required as described above. Can this be done in pandas?

You can set <code>Gender</code> as index and sum: <pre class="prettyprint"><code>s = df.set_index('Gender').iloc[:, 1:] s.eq(s.columns).astype(int).sum(level=0) </code></pre> Output: <pre class="prettyprint"><code> Python Bash R JavaScript C++ Gender Male 1 0 0 1 0 Female 0 0 1 1 1 Prefer not to say 1 1 0 0 0 </code></pre>

Assume your <code>nan</code> is <code>NaN</code> (i.e. it is not string), we may take advantage of <code>count</code> because it ignores <code>NaN</code> to get desired output <pre class="prettyprint"><code>df_out = df.iloc[:,2:].groupby(df.Gender, sort=False).count() Out[175]: Python Bash R JavaScript C++ Gender Male 1 0 0 1 0 Female 0 0 1 1 1 Prefer not to say 1 1 0 0 0 </code></pre>

Unstack and return value counts for each variable?

Tags:

python

pandas

dataframe

I have a data frame that records responses of 19717 people's choice of programing languages through multiple choice questions. The first column is of course the gender of the respondent while the rest are the choices they picked. The data frame is shown below, with each response being recorded as the same name as column. If no response is selected, then this results in a NaN.

ID     Gender              Python    Bash    R    JavaScript    C++
0      Male                Python    nan     nan  JavaScript    nan
1      Female              nan       nan     R    JavaScript    C++
2      Prefer not to say   Python    Bash    nan  nan           nan
3      Male                nan       nan     nan  nan           nan

What I want is a table that returns the count based on Gender. Hence if 5000 men coded in Python and 3000 women in JS, then I should get this:

Gender              Python    Bash    R    JavaScript    C++
Male                5000      1000    800  1500          1000
Female              4000      500     1500 3000          800
Prefer Not To Say   2000      ...   ...    ...           860

I have tried some of the options:

df.iloc[:, [*range(0, 13)]].stack().value_counts()

Male                       16138
Python                     12841
SQL                         6532
R                           4588
Female                      3212
Java                        2267
C++                         2256
Javascript                  2174
Bash                        2037
C                           1672
MATLAB                      1516
Other                       1148
TypeScript                   389
Prefer not to say            318
None                          83
Prefer to self-describe       49
dtype: int64

And it's not what is required as described above. Can this be done in pandas?

740

asked Nov 25 '19 13:11

shiv_90

4 Answers

Another idea would be to apply join values along axis 1, get_dummies then groupby:

(df.loc[:, 'Python':]
 .apply(lambda x: '|'.join(x.dropna()), axis=1)
 .str.get_dummies('|')
 .groupby(df['Gender']).sum())

[out]

                   Bash  C++  JavaScript  Python  R
Gender                                             
Female                0    1           1       0  1
Male                  0    0           1       1  0
Prefer not to say     1    0           0       1  0

answered Oct 11 '22 12:10

Chris Adams

You can set Gender as index and sum:

s = df.set_index('Gender').iloc[:, 1:]
s.eq(s.columns).astype(int).sum(level=0)

Output:

                   Python  Bash  R  JavaScript  C++
Gender                                             
Male                    1     0  0           1    0
Female                  0     0  1           1    1
Prefer not to say       1     1  0           0    0

answered Oct 11 '22 13:10

Quang Hoang

You can melt and use crosstab

df1 = pd.melt(df,id_vars=['ID','Gender'],var_name='Language',value_name='Choice')
df1['Choice'] = np.where(df1['Choice'] == df1['Language'],1,0)
final= pd.crosstab(df1['Gender'],df1['Language'],values=df1['Choice'],aggfunc='sum')

print(final)
Language              Bash  C++  JavaScript  Python  R
Gender                                              
Female                  0    1           1       0  1
Male                    0    0           1       1  0
Prefer not to say       1    0           0       1  0

answered Oct 11 '22 14:10

Umar.H

Assume your nan is NaN (i.e. it is not string), we may take advantage of count because it ignores NaN to get desired output

df_out = df.iloc[:,2:].groupby(df.Gender, sort=False).count()

Out[175]:
                   Python  Bash  R  JavaScript  C++
Gender
Male                    1     0  0           1    0
Female                  0     0  1           1    1
Prefer not to say       1     1  0           0    0

answered Oct 11 '22 12:10

Andy L.

Related questions
                            
                                Adaptive Threshold parameters confusion
                            
                                Flask-RESTful - don't return object property instead of returning null
                            
                                What can I do with a closed file object?
                            
                                What is most efficient way to find the intersection of a line and a circle in python?
                            
                                Python3's super and comprehensions -> TypeError?
                            
                                Replace multiple values with jinja2
                            
                                How to send and receive HTTP POST requests in Python [closed]
                            
                                How to multiply each row in pandas dataframe by a different value
                            
                                How to get text from span tag in BeautifulSoup
                            
                                What exactly does ./configure --enable-shared do during python altinstall?
                            
                                How does the python socket.recv() method know that the end of the message has been reached?
                            
                                Flask-RESTPlus - How to get query arguments?
                            
                                How to mock <ModelClass>.query.filter_by() in Flask-SqlAlchemy
                            
                                How to describe parameters in DRF Docs
                            
                                How to mock a Django model object (along with its methods)?
                            
                                List submodules of a python module
                            
                                Passing a function with multiple arguments to DataFrame.apply
                            
                                How to move labels from bottom to top without adding "ticks"
                            
                                Is there an efficient method of checking whether a column has mixed dtypes?
                            
                                How to force Django models to be released from memory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With