<p>I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.</p> <p>For example,</p> <pre class="prettyprint"><code>data = {"mesh": ["A, B, C", "C,B", ""]} </code></pre> <p>From this I would like to get a dataframe consisting of:</p> <pre class="prettyprint"><code>index A B. C 0 1 1 1 1 0 1 1 2 0 0 0 </code></pre> <p>How can I do this?</p>

<p>Note that you're not dealing with OHEs. </p> <h3> <code>str.split</code> + <code>stack</code> + <code>get_dummies</code> + <code>sum</code> </h3> <pre class="prettyprint"><code>df = pd.DataFrame(data) df mesh 0 A, B, C 1 C,B 2 (df.mesh.str.split('\s*,\s*', expand=True) .stack() .str.get_dummies() .sum(level=0)) df A B C 0 1 1 1 1 0 1 1 2 0 0 0 </code></pre> <hr> <h3> <code>apply</code> + <code>value_counts</code> </h3> <pre class="prettyprint"><code>(df.mesh.str.split(r'\s*,\s*', expand=True) .apply(pd.Series.value_counts, 1) .iloc[:, 1:] .fillna(0, downcast='infer')) A B C 0 1 1 1 1 0 1 1 2 0 0 0 </code></pre> <hr> <h3><code>pd.crosstab</code></h3> <pre class="prettyprint"><code>x = df.mesh.str.split('\s*,\s*', expand=True).stack() pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:] df col_0 A B C row_0 0 1 1 1 1 0 1 1 2 0 0 0 </code></pre>

Convert pandas DataFrame column of comma separated strings to one-hot encoded

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.

For example,

data = {"mesh": ["A, B, C", "C,B", ""]}

From this I would like to get a dataframe consisting of:

index      A       B.     C
0          1       1      1
1          0       1      1
2          0       0      0

How can I do this?

Which function in pandas is used for one-hot encoding?

The Pandas get dummies function, pd. get_dummies() , allows you to easily one-hot encode your categorical data. In this tutorial, you'll learn how to use the Pandas get_dummies function works and how to customize it. One-hot encoding is a common preprocessing step for categorical data in machine learning.

How do you change strings to categorical pandas?

astype() method is used to cast a pandas object to a specified dtype. astype() function also provides the capability to convert any suitable existing column to categorical type. DataFrame. astype() function comes very handy when we want to case a particular column data type to another data type.

How do I slice a column into a DataFrame?

When you wanted to slice a DataFrame by the range of columns, provide start and stop column names. By not providing a start column, loc[] selects from the beginning. By not providing stop, loc[] selects all columns from the start label. Providing both start and stop, selects all columns in between.

Note that you're not dealing with OHEs.

`str.split` + `stack` + `get_dummies` + `sum`

df = pd.DataFrame(data)
df

      mesh
0  A, B, C
1      C,B
2         

(df.mesh.str.split('\s*,\s*', expand=True)
   .stack()
   .str.get_dummies()
   .sum(level=0))
df

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

`apply` + `value_counts`

(df.mesh.str.split(r'\s*,\s*', expand=True)
   .apply(pd.Series.value_counts, 1)
   .iloc[:, 1:]
   .fillna(0, downcast='infer'))

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

`pd.crosstab`

x = df.mesh.str.split('\s*,\s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df

col_0  A  B  C
row_0         
0      1  1  1
1      0  1  1
2      0  0  0

Convert pandas DataFrame column of comma separated strings to one-hot encoded

Tags:

python

pandas

dataframe

scutnex

People also ask

1 Answers

`str.split` + `stack` + `get_dummies` + `sum`

`apply` + `value_counts`

`pd.crosstab`

cs95

Recent Activity

Donate For Us

Convert pandas DataFrame column of comma separated strings to one-hot encoded

Tags:

python

pandas

dataframe

scutnex

People also ask

1 Answers

str.split + stack + get_dummies + sum

apply + value_counts

pd.crosstab

cs95

Related questions

Recent Activity

Donate For Us

`str.split` + `stack` + `get_dummies` + `sum`

`apply` + `value_counts`

`pd.crosstab`