Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create dummies from column with multiple values in pandas

I am looking for for a pythonic way to handle the following problem.

The pandas.get_dummies() method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B'], get_dummies() creates 2 dummy variables and assigns 0 or 1 accordingly.

Now, I need to handle this situation. A single column, let's call it 'label', has values like ['A', 'B', 'C', 'D', 'A*C', 'C*D'] . get_dummies() creates 6 dummies, but I only want 4 of them, so that a row could have multiple 1s.

Is there a way to handle this in a pythonic way? I could only think of some step-by-step algorithm to get it, but that would not include get_dummies(). Thanks

Edited, hope it is more clear!

like image 298
mkln Avatar asked Sep 19 '13 08:09

mkln


People also ask

How do I create a dummy variable in multiple columns in Python?

For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) . If you have multiple categorical variables you simply add every variable name as a string to the list!

How do I split multiple values in a column in pandas?

Split column by delimiter into multiple columnsApply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.

How pandas replace multiple values with one value in a column?

Pandas replace multiple values in column replace. By using DataFrame. replace() method we will replace multiple values with multiple new strings or text for an individual DataFrame column. This method searches the entire Pandas DataFrame and replaces every specified value.


1 Answers

I know it's been a while since this question was asked, but there is (at least now there is) a one-liner that is supported by the documentation:

In [4]: df Out[4]:       label 0  (a, c, e) 1     (a, d) 2       (b,) 3     (d, e)  In [5]: df['label'].str.join(sep='*').str.get_dummies(sep='*') Out[5]:    a  b  c  d  e 0  1  0  1  0  1 1  1  0  0  1  0 2  0  1  0  0  0 3  0  0  0  1  1 
like image 70
offbyone Avatar answered Oct 02 '22 19:10

offbyone