Converting pandas column of comma-separated strings into dummy variables

Tags:

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:

0    'a'
1    'a,b,c'
2    'a,b,d'
3    'd'
4    'c,d'

Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!

Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!

   a  b  c  d
0  1  0  0  0
1  1  1  1  0
2  1  1  0  1
3  0  0  0  1
4  0  0  1  1

818

asked Oct 21 '17 19:10

breakbotz

2 Answers

Use str.get_dummies

df['col'].str.get_dummies(sep=',')

    a   b   c   d
0   1   0   0   0
1   1   1   1   0
2   1   1   0   1
3   0   0   0   1
4   0   0   1   1

Edit: Updating the answer to address some questions.

Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it

Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.

If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix

df['col'].str.get_dummies(sep=',').add_prefix('col_')

Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame? You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.

df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)

  other a   b   c   d
0   x   1   0   0   0
1   y   1   1   1   0
2   x   1   1   0   1
3   x   0   0   0   1
4   q   0   0   1   1

148

answered Oct 19 '22 20:10

Vaishali

The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:

data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

answered Oct 19 '22 21:10

micmia

Related questions
                            
                                How to partially copy using python an Hdf5 file into a new one keeping the same structure?
                            
                                Pass column name as parameter to PostgreSQL using psycopg2
                            
                                ImportError: cannot import name 'QStringList' in PyQt5
                            
                                How to filter/smooth with SciPy/Numpy?
                            
                                Unable to debug in PyCharm because of an ImportError in pydevconsole.py
                            
                                Fastest way to compare row and previous row in pandas dataframe with millions of rows
                            
                                Why "rv" in Flask testing tutorial? [closed]
                            
                                How to convert a single number into a single item list in python
                            
                                How / why does Python type hinting syntax work?
                            
                                Check database schema matches SQLAlchemy models on application startup
                            
                                Convert pandas freq string to timedelta
                            
                                Pandas type error trying to plot
                            
                                Pandas html: Don't truncate long values
                            
                                pyenv: pip: command not found
                            
                                How do I strip all leading and trailing punctuation in Python? [duplicate]
                            
                                how to save a scikit-learn pipline with keras regressor inside to disk?
                            
                                Efficient Python Pandas Stock Beta Calculation on Many Dataframes
                            
                                Python + OpenCV: OCR Image Segmentation
                            
                                Python: Convert dataframe into a list with string items inside list
                            
                                Jupyter notebook xgboost import

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting pandas column of comma-separated strings into dummy variables

Tags:

python

split

pandas

dummy-variable

breakbotz

People also ask

2 Answers

Vaishali

micmia

Recent Activity

Donate For Us