Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting pandas column of comma-separated strings into dummy variables

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:

0    'a'
1    'a,b,c'
2    'a,b,d'
3    'd'
4    'c,d'

Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!

Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!

   a  b  c  d
0  1  0  0  0
1  1  1  1  0
2  1  1  0  1
3  0  0  0  1
4  0  0  1  1
like image 818
breakbotz Avatar asked Oct 21 '17 19:10

breakbotz


People also ask

How do you convert a column to a dummy variable in Python?

To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) .

What's the use of pandas Get_dummies () method?

get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

What does Drop_first do in Get_dummies?

get_dummies there is a parameter i.e. drop_first allows you whether to keep or remove the reference (whether to keep k or k-1 dummies out of k categorical levels).


2 Answers

Use str.get_dummies

df['col'].str.get_dummies(sep=',')

    a   b   c   d
0   1   0   0   0
1   1   1   1   0
2   1   1   0   1
3   0   0   0   1
4   0   0   1   1

Edit: Updating the answer to address some questions.

Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it

Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.

If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix

df['col'].str.get_dummies(sep=',').add_prefix('col_')

Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame? You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.

df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)

  other a   b   c   d
0   x   1   0   0   0
1   y   1   1   1   0
2   x   1   1   0   1
3   x   0   0   0   1
4   q   0   0   1   1
like image 148
Vaishali Avatar answered Oct 19 '22 20:10

Vaishali


The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:

data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')
like image 4
micmia Avatar answered Oct 19 '22 21:10

micmia