I have a Google Form which I am using to collect survey data (for this question I'll be using an example form) which has questions which can have multiple answers, selected using a set of checkboxes.
When I get the data from the form and import it into pandas I get this:
Timestamp What sweets do you like?
0 23/11/2013 13:22:30 Chocolate, Toffee, Popcorn
1 23/11/2013 13:22:34 Chocolate
2 23/11/2013 13:22:39 Toffee, Popcorn, Fruit
3 23/11/2013 13:22:45 Fudge, Toffee
4 23/11/2013 13:22:48 Popcorn
I'd like to do statistics on the results of the question (how many people liked Chocolate, what proportion of people liked Toffee etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.
Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Chocolate, Toffee, Popcorn, Fudge and Fruit, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this, and I'm not sure whether it would really help (doing the aggregations that I want to do might be harder in that way).
Few days ago I encountered same problem and after some searches, I found str.get_dummies function in pandas documentation. Let's see how it works:
As mentioned in documentation, str.get_dummies split each string in the Series by sep and return a DataFrame of dummy/indicator variables.
Here is the simplified version of above mentioned DataFrame:
In [27]: df
Out[27]:
What sweets do you like?
0 Chocolate, Toffee, Popcorn
1 Chocolate
2 Toffee, Popcorn, Fruit
3 Fudge, Toffee
4 Popcorn
The only argument that we need to specify in str.get_dummies is the sep, which in our case is comma:
In [28]: df['What sweets do you like?'].str.get_dummies(sep=', ')
Out[28]:
Chocolate Fruit Fudge Popcorn Toffee
0 1 0 0 1 1
1 1 0 0 0 0
2 0 1 0 1 1
3 0 0 1 0 1
4 0 0 0 1 0
Note that there is a space after comma in sep argument because space itself is a character, if we don't include it in the sep, the result will be something like below which is obviously wrong:
In [29]: df['What sweets do you like?'].str.get_dummies(sep=',')
Out[29]:
Fruit Popcorn Toffee Chocolate Fudge Popcorn Toffee
0 0 1 1 1 0 0 0
1 0 0 0 1 0 0 0
2 1 1 0 0 0 0 1
3 0 0 1 0 1 0 0
4 0 0 0 0 0 1 0
As a rule of thumb, always pay attention to write the separator exactly!
Read in as a fixed width table, droping the first column
In [30]: df = pd.read_fwf(StringIO(data),widths=[3,20,27]).drop(['Unnamed: 0'],axis=1)
In [31]: df
Out[31]:
Timestamp What sweets do you like0
0 23/11/2013 13:22:34 Chocolate
1 23/11/2013 13:22:39 Toffee, Popcorn, Fruit
2 23/11/2013 13:22:45 Fudge, Toffee
3 23/11/2013 13:22:48 Popcorn
Make the timestamp into a proper datetime64 dtype (not necessary for this exercise), but almost always what you want.
In [32]: df['Timestamp'] = pd.to_datetime(df['Timestamp'])
New column names
In [33]: df.columns = ['date','sweets']
In [34]: df
Out[34]:
date sweets
0 2013-11-23 13:22:34 Chocolate
1 2013-11-23 13:22:39 Toffee, Popcorn, Fruit
2 2013-11-23 13:22:45 Fudge, Toffee
3 2013-11-23 13:22:48 Popcorn
In [35]: df.dtypes
Out[35]:
date datetime64[ns]
sweets object
dtype: object
Split the sweet column from a string into a list
In [37]: df['sweets'].str.split(',\s*')
Out[37]:
0 [Chocolate]
1 [Toffee, Popcorn, Fruit]
2 [Fudge, Toffee]
3 [Popcorn]
Name: sweets, dtype: object
The key step, this creates a dummy matrix for where the values exist
In [38]: df['sweets'].str.split(',\s*').apply(lambda x: pd.Series(1,index=x))
Out[38]:
Chocolate Fruit Fudge Popcorn Toffee
0 1 NaN NaN NaN NaN
1 NaN 1 NaN 1 1
2 NaN NaN 1 NaN 1
3 NaN NaN NaN 1 NaN
Final result where we fill the nans to 0, then astype to bool to make True/False. Then concatate it to the original frame
In [40]: pd.concat([df,df['sweets'].str.split(',\s*').apply(lambda x: pd.Series(1,index=x)).fillna(0).astype(bool)],axis=1)
Out[40]:
date sweets Chocolate Fruit Fudge Popcorn Toffee
0 2013-11-23 13:22:34 Chocolate True False False False False
1 2013-11-23 13:22:39 Toffee, Popcorn, Fruit False True False True True
2 2013-11-23 13:22:45 Fudge, Toffee False False True False True
3 2013-11-23 13:22:48 Popcorn False False False True False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With