Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a Pandas Dataframe column into one hot labels

I have a pandas dataframe similar to this:

  Col1   ABC
0  XYZ    A
1  XYZ    B
2  XYZ    C

By using the pandas get_dummies() function on column ABC, I can get this:

  Col1   A   B   C
0  XYZ   1   0   0
1  XYZ   0   1   0
2  XYZ   0   0   1

While I need something like this, where the ABC column has a list / array datatype:

  Col1    ABC
0  XYZ    [1,0,0]
1  XYZ    [0,1,0]
2  XYZ    [0,0,1]

I tried using the get_dummies function and then combining all the columns into the column which I wanted. I found lot of answers explaining how to combine multiple columns as strings, like this: Combine two columns of text in dataframe in pandas/python. But I cannot figure out a way to combine them as a list.

This question introduced the idea of using sklearn's OneHotEncoder, but I couldn't get it to work. How do I one-hot encode one column of a pandas dataframe?

One more thing: All the answers I came across had solutions where the column names had to be manually typed while combining them. Is there a way to use Dataframe.iloc() or splicing mechanism to combine columns into a list?

like image 751
Nir_J Avatar asked Nov 05 '17 22:11

Nir_J


3 Answers

You can just use tolist():

df['ABC'] = pd.get_dummies(df.ABC).values.tolist()

  Col1        ABC
0  XYZ  [1, 0, 0]
1  XYZ  [0, 1, 0]
2  XYZ  [0, 0, 1]
like image 41
andrew_reece Avatar answered Oct 21 '22 11:10

andrew_reece


Here is an example of using sklearn.preprocessing.LabelBinarizer:

In [361]: from sklearn.preprocessing import LabelBinarizer

In [362]: lb = LabelBinarizer()

In [363]: df['new'] = lb.fit_transform(df['ABC']).tolist()

In [364]: df
Out[364]:
  Col1 ABC        new
0  XYZ   A  [1, 0, 0]
1  XYZ   B  [0, 1, 0]
2  XYZ   C  [0, 0, 1]

Pandas alternative:

In [370]: df['new'] = df['ABC'].str.get_dummies().values.tolist()

In [371]: df
Out[371]:
  Col1 ABC        new
0  XYZ   A  [1, 0, 0]
1  XYZ   B  [0, 1, 0]
2  XYZ   C  [0, 0, 1]
like image 122
MaxU - stop WAR against UA Avatar answered Oct 21 '22 13:10

MaxU - stop WAR against UA


If you have a pd.DataFrame like this:

>>> df
  Col1  A  B  C
0  XYZ  1  0  0
1  XYZ  0  1  0
2  XYZ  0  0  1

You can always do something like this:

>>> df.apply(lambda s: list(s[1:]), axis=1)
0    [1, 0, 0]
1    [0, 1, 0]
2    [0, 0, 1]
dtype: object

Note, this is essentially a for-loop on the rows. Note, columns do not have list data-types, they must be object, which will make your data-frame operations not able to take advantage of the speed benefits of numpy.

like image 2
juanpa.arrivillaga Avatar answered Oct 21 '22 11:10

juanpa.arrivillaga