Using Scikit-Learn OneHotEncoder with a Pandas DataFrame

Tags:

I'm trying to replace a column within a Pandas DataFrame containing strings into a one-hot encoded equivalent using Scikit-Learn's OneHotEncoder. My code below doesn't work:

Click to copy

from sklearn.preprocessing import OneHotEncoder
# data is a Pandas DataFrame

jobs_encoder = OneHotEncoder()
jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

It produces the following error (strings in the list are omitted):

Click to copy

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-91-3a1f568322f5> in <module>()
      3 jobs_encoder = OneHotEncoder()
      4 jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
----> 5 data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
    730                                        copy=True)
    731         else:
--> 732             return self._transform_new(X)
    733 
    734     def inverse_transform(self, X):

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
    678         """New implementation assuming categorical input"""
    679         # validation of X happens in _check_X called by _transform
--> 680         X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
    681 
    682         n_samples, n_features = X_int.shape

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    120                     msg = ("Found unknown categories {0} in column {1}"
    121                            " during transform".format(diff, i))
--> 122                     raise ValueError(msg)
    123                 else:
    124                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['...', ..., '...'] in column 0 during transform

Here's some sample data:

Click to copy

data['Profession'] =

0         unkn
1         safe
2         rece
3         unkn
4         lead
          ... 
111988    indu
111989    seni
111990    mess
111991    seni
111992    proj
Name: Profession, Length: 111993, dtype: object

What exactly am I doing wrong?

633

asked Sep 25 '19 14:09

dd.

3 Answers

OneHotEncoder Encodes categorical integer features as a one-hot numeric array. Its Transform method returns a sparse matrix if sparse=True, otherwise it returns a 2-d array.

You can't cast a 2-d array (or sparse matrix) into a Pandas Series. You must create a Pandas Serie (a column in a Pandas dataFrame) for each category.

I would recommend pandas.get_dummies instead:

Click to copy

data = pd.get_dummies(data,prefix=['Profession'], columns = ['Profession'], drop_first=True)

EDIT:

Using Sklearn OneHotEncoder:

Click to copy

transformed = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names())
#concat with original data
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

Other Options: If you are doing hyperparameter tuning with GridSearch it's recommanded to use ColumnTransformer and FeatureUnion with Pipeline or directly make_column_transformer

195

answered Oct 14 '22 01:10

Amine Benatmane

So turned out that Scikit-Learns LabelBinarizer gave me better luck in converting the data to one-hot encoded format, with help from Amnie's solution, my final code is as follows

Click to copy

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

jobs_encoder = LabelBinarizer()
jobs_encoder.fit(data['Profession'])
transformed = jobs_encoder.transform(data['Profession'])
ohe_df = pd.DataFrame(transformed)
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

answered Oct 13 '22 23:10

dd.

This below is an approach suggested by Kaggle Learn. Do not think there is a simpler way to do so at the moment to go from an original pandas DataFrame to a one-hot encoded DataFrame.

Click to copy

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
numeric_X_train = X_train.drop(low_cardinality_cols, axis=1)
numeric_X_valid = X_valid.drop(low_cardinality_cols, axis=1)

# Add one-hot encoded columns to numerical features
new_X_train = pd.concat([numeric_X_train, OH_cols_train], axis=1)
new_X_valid = pd.concat([numeric_X_valid, OH_cols_valid], axis=1)
print(new_X_train)

answered Oct 14 '22 00:10

Kris Stern

Related questions
                            
                                Rotate logfiles each time the application is started (Python)
                            
                                Extracting lxml xpath for html table
                            
                                python : working with german umlaut
                            
                                How to write alter column name migrations with sqlalchemy-migrate?
                            
                                Twisted Installation Failed on Linux
                            
                                Tests succeed, still get traceback
                            
                                Indicating the statistically significant difference in bar graph
                            
                                Apply list of functions on an object in Python
                            
                                Daylight savings time in Python
                            
                                why __getitem__ cannot be classmethod?
                            
                                Redirect back in Flask
                            
                                python TCPServer address already in use but I close the server and I use `allow_reuse_address`
                            
                                Proxy: Selenium + Python, Firefox
                            
                                Record speakers output with PyAudio
                            
                                Django Rest Framework object is not iterable?
                            
                                patch multiple methods from different modules (using Python mock)
                            
                                Adding members to Python Enums
                            
                                How to sign and verify signature with ecdsa in python
                            
                                How to call Shell script or python script in from a Atom electron app
                            
                                What are the names of the magic methods for the operators "is" and "in"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Scikit-Learn OneHotEncoder with a Pandas DataFrame

Tags:

python

pandas

machine-learning

one-hot-encoding

scikit-learn

dd.

People also ask

3 Answers

Amine Benatmane

dd.

Kris Stern

Recent Activity

Donate For Us