Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One-Hot-Encode categorical variables and scale continuous ones simultaneouely

Tags:

I'm confused because it's going to be a problem if you first do OneHotEncoder and then StandardScaler because the scaler will also scale the columns previously transformed by OneHotEncoder. Is there a way to perform encoding and scaling at the same time and then concatenate the results together?

like image 562
James Wong Avatar asked May 05 '17 06:05

James Wong


People also ask

Which method is used for encoding the categorical variables one-hot encoder?

One-Hot Encoding is the process of creating dummy variables. This technique is used for categorical variables where order does not matter. One-Hot encoding technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created.

What are the possible challenges when performing one-hot encoding on a categorical variable?

Challenges of One-Hot Encoding: Dummy Variable Trap Dummy Variable Trap is a scenario in which variables are highly correlated to each other. The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features.

What are one-hot encoded variables?

The One Hot Encoding technique creates a number of additional features based on the number of unique values in the categorical feature. Every unique value in the category is added as a feature. Hence the One Hot Encoding is known as the process of creating dummy variables.

What is one-hot encoding and categorical variables?

Before we get into what One-Hot Encoding is, let’s briefly define what categorical variables are. Categorical Variables contain values that are names, labels or strings. At first glance, these variables seem harmless.

What are the different types of encoding in categorical data?

Encoding Categorical Data 1 Ordinal Encoding. In ordinal encoding, each unique category value is assigned an integer value. ... 2 One-Hot Encoding. For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst. 3 Dummy Variable Encoding. ...

What is one hot encoding in machine learning?

One-hot encoding, otherwise known as dummy variables, is a method of converting categorical variables into several binary columns, where a 1 indicates the presence of that row belonging to that category. It is, pretty obviously, not a great a choice for the encoding of categorical variables from a machine learning perspective.

When is integer encoding not enough for categorical variables?

For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst.


2 Answers

Sure thing. Just separately scale and one-hot-encode the separate columns as needed:

# Import libraries and download example data from sklearn.preprocessing import StandardScaler, OneHotEncoder  dataset = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv") print(dataset.head(5))  # Define which columns should be encoded vs scaled columns_to_encode = ['rank'] columns_to_scale  = ['gre', 'gpa']  # Instantiate encoder/scaler scaler = StandardScaler() ohe    = OneHotEncoder(sparse=False)  # Scale and Encode Separate Columns scaled_columns  = scaler.fit_transform(dataset[columns_to_scale])  encoded_columns =    ohe.fit_transform(dataset[columns_to_encode])  # Concatenate (Column-Bind) Processed Columns Back Together processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1) 
like image 120
Max Power Avatar answered Oct 27 '22 20:10

Max Power


Scikit-learn from version 0.20 provides sklearn.compose.ColumnTransformer to do Column Transformer with Mixed Types. You can scale the numeric features and one-hot encode the categorical ones together. Below is the offical example(you can find the code here ):

# Author: Pedro Morales <[email protected]> # # License: BSD 3 clause  from __future__ import print_function  import pandas as pd import numpy as np  from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split, GridSearchCV  np.random.seed(0)  # Read data from Titanic dataset. titanic_url = ('https://raw.githubusercontent.com/amueller/'                'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv') data = pd.read_csv(titanic_url)  # We will train our classifier with the following features: # Numeric Features: # - age: float. # - fare: float. # Categorical Features: # - embarked: categories encoded as strings {'C', 'S', 'Q'}. # - sex: categories encoded as strings {'female', 'male'}. # - pclass: ordinal integers {1, 2, 3}.  # We create the preprocessing pipelines for both numeric and categorical data. numeric_features = ['age', 'fare'] numeric_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='median')),     ('scaler', StandardScaler())])  categorical_features = ['embarked', 'sex', 'pclass'] categorical_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),     ('onehot', OneHotEncoder(handle_unknown='ignore'))])  preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, numeric_features),         ('cat', categorical_transformer, categorical_features)])  # Append classifier to preprocessing pipeline. # Now we have a full prediction pipeline. clf = Pipeline(steps=[('preprocessor', preprocessor),                       ('classifier', LogisticRegression(solver='lbfgs'))])  X = data.drop('survived', axis=1) y = data['survived']  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)  clf.fit(X_train, y_train) print("model score: %.3f" % clf.score(X_test, y_test)) 

Caution: this method is EXPERIMENTAL, some behaviors may change between releases without deprecation.

like image 33
NiYanchun Avatar answered Oct 27 '22 19:10

NiYanchun