I have a data frame that records responses of 19717 people's choice of programing languages through multiple choice questions. The first column is of course the gender of the respondent while the rest are the choices they picked. The data frame is shown below, with each response being recorded as the same name as column. If no response is selected, then this results in a NaN
.
ID Gender Python Bash R JavaScript C++
0 Male Python nan nan JavaScript nan
1 Female nan nan R JavaScript C++
2 Prefer not to say Python Bash nan nan nan
3 Male nan nan nan nan nan
What I want is a table that returns the count based on Gender
. Hence if 5000 men coded in Python and 3000 women in JS, then I should get this:
Gender Python Bash R JavaScript C++
Male 5000 1000 800 1500 1000
Female 4000 500 1500 3000 800
Prefer Not To Say 2000 ... ... ... 860
I have tried some of the options:
df.iloc[:, [*range(0, 13)]].stack().value_counts()
Male 16138
Python 12841
SQL 6532
R 4588
Female 3212
Java 2267
C++ 2256
Javascript 2174
Bash 2037
C 1672
MATLAB 1516
Other 1148
TypeScript 389
Prefer not to say 318
None 83
Prefer to self-describe 49
dtype: int64
And it's not what is required as described above. Can this be done in pandas?
Unstack is also similar to the stack method, it returns a DataFrame having a new level of column labels. It has 2 parameters which are level and fill_value. The level parameter takes an integer, string, list of these, and the Default value is 1 (1 is the last level).
Pandas provides various built-in methods for reshaping DataFrame. Among them, stack() and unstack() are the 2 most popular methods for restructuring columns and rows (also known as index). stack() : stack the prescribed level(s) from column to row. unstack() : unstack the prescribed level(s) from row to column.
Group by and value_counts Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.
Another idea would be to apply
join
values along axis 1, get_dummies
then groupby
:
(df.loc[:, 'Python':]
.apply(lambda x: '|'.join(x.dropna()), axis=1)
.str.get_dummies('|')
.groupby(df['Gender']).sum())
[out]
Bash C++ JavaScript Python R
Gender
Female 0 1 1 0 1
Male 0 0 1 1 0
Prefer not to say 1 0 0 1 0
You can set Gender
as index and sum:
s = df.set_index('Gender').iloc[:, 1:]
s.eq(s.columns).astype(int).sum(level=0)
Output:
Python Bash R JavaScript C++
Gender
Male 1 0 0 1 0
Female 0 0 1 1 1
Prefer not to say 1 1 0 0 0
You can melt
and use crosstab
df1 = pd.melt(df,id_vars=['ID','Gender'],var_name='Language',value_name='Choice')
df1['Choice'] = np.where(df1['Choice'] == df1['Language'],1,0)
final= pd.crosstab(df1['Gender'],df1['Language'],values=df1['Choice'],aggfunc='sum')
print(final)
Language Bash C++ JavaScript Python R
Gender
Female 0 1 1 0 1
Male 0 0 1 1 0
Prefer not to say 1 0 0 1 0
Assume your nan
is NaN
(i.e. it is not string), we may take advantage of count
because it ignores NaN
to get desired output
df_out = df.iloc[:,2:].groupby(df.Gender, sort=False).count()
Out[175]:
Python Bash R JavaScript C++
Gender
Male 1 0 0 1 0
Female 0 0 1 1 1
Prefer not to say 1 1 0 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With