Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert pandas DataFrame column of comma separated strings to one-hot encoded

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.

For example,

data = {"mesh": ["A, B, C", "C,B", ""]}

From this I would like to get a dataframe consisting of:

index      A       B.     C
0          1       1      1
1          0       1      1
2          0       0      0

How can I do this?

like image 418
scutnex Avatar asked Oct 21 '17 15:10

scutnex


People also ask

Which function in pandas is used for one-hot encoding?

The Pandas get dummies function, pd. get_dummies() , allows you to easily one-hot encode your categorical data. In this tutorial, you'll learn how to use the Pandas get_dummies function works and how to customize it. One-hot encoding is a common preprocessing step for categorical data in machine learning.

How do you change strings to categorical pandas?

astype() method is used to cast a pandas object to a specified dtype. astype() function also provides the capability to convert any suitable existing column to categorical type. DataFrame. astype() function comes very handy when we want to case a particular column data type to another data type.

How do I slice a column into a DataFrame?

When you wanted to slice a DataFrame by the range of columns, provide start and stop column names. By not providing a start column, loc[] selects from the beginning. By not providing stop, loc[] selects all columns from the start label. Providing both start and stop, selects all columns in between.


1 Answers

Note that you're not dealing with OHEs.

str.split + stack + get_dummies + sum

df = pd.DataFrame(data)
df

      mesh
0  A, B, C
1      C,B
2         

(df.mesh.str.split('\s*,\s*', expand=True)
   .stack()
   .str.get_dummies()
   .sum(level=0))
df

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

apply + value_counts

(df.mesh.str.split(r'\s*,\s*', expand=True)
   .apply(pd.Series.value_counts, 1)
   .iloc[:, 1:]
   .fillna(0, downcast='infer'))

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

pd.crosstab

x = df.mesh.str.split('\s*,\s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df

col_0  A  B  C
row_0         
0      1  1  1
1      0  1  1
2      0  0  0
like image 197
cs95 Avatar answered Oct 18 '22 15:10

cs95