Using Scikit-Learn OneHotEncoder with a Pandas DataFrame

I'm trying to replace a column within a Pandas DataFrame containing strings into a one-hot encoded equivalent using Scikit-Learn's OneHotEncoder. My code below doesn't work:

from sklearn.preprocessing import OneHotEncoder
# data is a Pandas DataFrame

jobs_encoder = OneHotEncoder()
jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

It produces the following error (strings in the list are omitted):

ValueError                                Traceback (most recent call last)
<ipython-input-91-3a1f568322f5> in <module>()
      3 jobs_encoder = OneHotEncoder()
      4 jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
----> 5 data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
    730                                        copy=True)
    731         else:
--> 732             return self._transform_new(X)
    734     def inverse_transform(self, X):

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
    678         """New implementation assuming categorical input"""
    679         # validation of X happens in _check_X called by _transform
--> 680         X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
    682         n_samples, n_features = X_int.shape

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    120                     msg = ("Found unknown categories {0} in column {1}"
    121                            " during transform".format(diff, i))
--> 122                     raise ValueError(msg)
    123                 else:
    124                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['...', ..., '...'] in column 0 during transform

Here's some sample data:

data['Profession'] =

0         unkn
1         safe
2         rece
3         unkn
4         lead
111988    indu
111989    seni
111990    mess
111991    seni
111992    proj
Name: Profession, Length: 111993, dtype: object

What exactly am I doing wrong?

People also ask

Can scikit-learn use pandas DataFrame?

Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.

What is OneHotEncoder in Sklearn?

OneHotEncoder. Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.

What is the difference between OneHotEncoder and Get_dummies?

(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.

3 Answers

OneHotEncoder Encodes categorical integer features as a one-hot numeric array. Its Transform method returns a sparse matrix if sparse=True, otherwise it returns a 2-d array.

You can't cast a 2-d array (or sparse matrix) into a Pandas Series. You must create a Pandas Serie (a column in a Pandas dataFrame) for each category.

I would recommend pandas.get_dummies instead:

data = pd.get_dummies(data,prefix=['Profession'], columns = ['Profession'], drop_first=True)


Using Sklearn OneHotEncoder:

transformed = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names())
#concat with original data
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

Other Options: If you are doing hyperparameter tuning with GridSearch it's recommanded to use ColumnTransformer and FeatureUnion with Pipeline or directly make_column_transformer

So turned out that Scikit-Learns LabelBinarizer gave me better luck in converting the data to one-hot encoded format, with help from Amnie's solution, my final code is as follows

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

jobs_encoder = LabelBinarizer()
transformed = jobs_encoder.transform(data['Profession'])
ohe_df = pd.DataFrame(transformed)
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
numeric_X_train = X_train.drop(low_cardinality_cols, axis=1)
numeric_X_valid = X_valid.drop(low_cardinality_cols, axis=1)

# Add one-hot encoded columns to numerical features
new_X_train = pd.concat([numeric_X_train, OH_cols_train], axis=1)
new_X_valid = pd.concat([numeric_X_valid, OH_cols_valid], axis=1)
