Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Process multiple-answer questionnaire (from Google Forms) results with pandas

I have a Google Form which I am using to collect survey data (for this question I'll be using an example form) which has questions which can have multiple answers, selected using a set of checkboxes.

When I get the data from the form and import it into pandas I get this:

             Timestamp    What sweets do you like?
0  23/11/2013 13:22:30  Chocolate, Toffee, Popcorn
1  23/11/2013 13:22:34                   Chocolate
2  23/11/2013 13:22:39      Toffee, Popcorn, Fruit
3  23/11/2013 13:22:45               Fudge, Toffee
4  23/11/2013 13:22:48                     Popcorn

I'd like to do statistics on the results of the question (how many people liked Chocolate, what proportion of people liked Toffee etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.

Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Chocolate, Toffee, Popcorn, Fudge and Fruit, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this, and I'm not sure whether it would really help (doing the aggregations that I want to do might be harder in that way).

like image 921
robintw Avatar asked Mar 19 '26 17:03

robintw


2 Answers

Few days ago I encountered same problem and after some searches, I found str.get_dummies function in pandas documentation. Let's see how it works:

pandas.Series.str.get_dummies

As mentioned in documentation, str.get_dummies split each string in the Series by sep and return a DataFrame of dummy/indicator variables.

Here is the simplified version of above mentioned DataFrame:

In [27]: df
Out[27]: 
     What sweets do you like?
0  Chocolate, Toffee, Popcorn
1                   Chocolate
2      Toffee, Popcorn, Fruit
3               Fudge, Toffee
4                     Popcorn

The only argument that we need to specify in str.get_dummies is the sep, which in our case is comma:

In [28]: df['What sweets do you like?'].str.get_dummies(sep=', ')
Out[28]: 
   Chocolate  Fruit  Fudge  Popcorn  Toffee
0          1      0      0        1       1
1          1      0      0        0       0
2          0      1      0        1       1
3          0      0      1        0       1
4          0      0      0        1       0

Attention:

Note that there is a space after comma in sep argument because space itself is a character, if we don't include it in the sep, the result will be something like below which is obviously wrong:

In [29]: df['What sweets do you like?'].str.get_dummies(sep=',')
Out[29]: 
    Fruit   Popcorn   Toffee  Chocolate  Fudge  Popcorn  Toffee
0       0         1        1          1      0        0       0
1       0         0        0          1      0        0       0
2       1         1        0          0      0        0       1
3       0         0        1          0      1        0       0
4       0         0        0          0      0        1       0

As a rule of thumb, always pay attention to write the separator exactly!

like image 161
Mohammad Reza Malekpour Avatar answered Mar 22 '26 05:03

Mohammad Reza Malekpour


Read in as a fixed width table, droping the first column

In [30]: df = pd.read_fwf(StringIO(data),widths=[3,20,27]).drop(['Unnamed: 0'],axis=1)

In [31]: df
Out[31]: 
             Timestamp What sweets do you like0
0  23/11/2013 13:22:34                Chocolate
1  23/11/2013 13:22:39   Toffee, Popcorn, Fruit
2  23/11/2013 13:22:45            Fudge, Toffee
3  23/11/2013 13:22:48                  Popcorn

Make the timestamp into a proper datetime64 dtype (not necessary for this exercise), but almost always what you want.

In [32]: df['Timestamp'] = pd.to_datetime(df['Timestamp'])

New column names

In [33]: df.columns = ['date','sweets']

In [34]: df
Out[34]: 
                 date                  sweets
0 2013-11-23 13:22:34               Chocolate
1 2013-11-23 13:22:39  Toffee, Popcorn, Fruit
2 2013-11-23 13:22:45           Fudge, Toffee
3 2013-11-23 13:22:48                 Popcorn

In [35]: df.dtypes
Out[35]: 
date      datetime64[ns]
sweets            object
dtype: object

Split the sweet column from a string into a list

In [37]: df['sweets'].str.split(',\s*')
Out[37]: 
0                 [Chocolate]
1    [Toffee, Popcorn, Fruit]
2             [Fudge, Toffee]
3                   [Popcorn]
Name: sweets, dtype: object

The key step, this creates a dummy matrix for where the values exist

In [38]: df['sweets'].str.split(',\s*').apply(lambda x: pd.Series(1,index=x))
Out[38]: 
   Chocolate  Fruit  Fudge  Popcorn  Toffee
0          1    NaN    NaN      NaN     NaN
1        NaN      1    NaN        1       1
2        NaN    NaN      1      NaN       1
3        NaN    NaN    NaN        1     NaN

Final result where we fill the nans to 0, then astype to bool to make True/False. Then concatate it to the original frame

In [40]: pd.concat([df,df['sweets'].str.split(',\s*').apply(lambda x: pd.Series(1,index=x)).fillna(0).astype(bool)],axis=1)
Out[40]: 
                 date                  sweets Chocolate  Fruit  Fudge Popcorn Toffee
0 2013-11-23 13:22:34               Chocolate      True  False  False   False  False
1 2013-11-23 13:22:39  Toffee, Popcorn, Fruit     False   True  False    True   True
2 2013-11-23 13:22:45           Fudge, Toffee     False  False   True   False   True
3 2013-11-23 13:22:48                 Popcorn     False  False  False    True  False
like image 27
Jeff Avatar answered Mar 22 '26 07:03

Jeff



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!