One-Hot-Encode categorical variables and scale continuous ones simultaneouely

Tags:

I'm confused because it's going to be a problem if you first do OneHotEncoder and then StandardScaler because the scaler will also scale the columns previously transformed by OneHotEncoder. Is there a way to perform encoding and scaling at the same time and then concatenate the results together?

562

asked May 05 '17 06:05

James Wong

2 Answers

Sure thing. Just separately scale and one-hot-encode the separate columns as needed:

# Import libraries and download example data from sklearn.preprocessing import StandardScaler, OneHotEncoder  dataset = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv") print(dataset.head(5))  # Define which columns should be encoded vs scaled columns_to_encode = ['rank'] columns_to_scale  = ['gre', 'gpa']  # Instantiate encoder/scaler scaler = StandardScaler() ohe    = OneHotEncoder(sparse=False)  # Scale and Encode Separate Columns scaled_columns  = scaler.fit_transform(dataset[columns_to_scale])  encoded_columns =    ohe.fit_transform(dataset[columns_to_encode])  # Concatenate (Column-Bind) Processed Columns Back Together processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)

120

answered Oct 27 '22 20:10

Max Power

Scikit-learn from version 0.20 provides sklearn.compose.ColumnTransformer to do Column Transformer with Mixed Types. You can scale the numeric features and one-hot encode the categorical ones together. Below is the offical example(you can find the code here ):

# Author: Pedro Morales <[email protected]> # # License: BSD 3 clause  from __future__ import print_function  import pandas as pd import numpy as np  from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split, GridSearchCV  np.random.seed(0)  # Read data from Titanic dataset. titanic_url = ('https://raw.githubusercontent.com/amueller/'                'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv') data = pd.read_csv(titanic_url)  # We will train our classifier with the following features: # Numeric Features: # - age: float. # - fare: float. # Categorical Features: # - embarked: categories encoded as strings {'C', 'S', 'Q'}. # - sex: categories encoded as strings {'female', 'male'}. # - pclass: ordinal integers {1, 2, 3}.  # We create the preprocessing pipelines for both numeric and categorical data. numeric_features = ['age', 'fare'] numeric_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='median')),     ('scaler', StandardScaler())])  categorical_features = ['embarked', 'sex', 'pclass'] categorical_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),     ('onehot', OneHotEncoder(handle_unknown='ignore'))])  preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, numeric_features),         ('cat', categorical_transformer, categorical_features)])  # Append classifier to preprocessing pipeline. # Now we have a full prediction pipeline. clf = Pipeline(steps=[('preprocessor', preprocessor),                       ('classifier', LogisticRegression(solver='lbfgs'))])  X = data.drop('survived', axis=1) y = data['survived']  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)  clf.fit(X_train, y_train) print("model score: %.3f" % clf.score(X_test, y_test))

Caution: this method is EXPERIMENTAL, some behaviors may change between releases without deprecation.

answered Oct 27 '22 19:10

NiYanchun

Related questions
                            
                                How to Include JS file in ionic 3
                            
                                ImportError: No module named pandas. Pandas installed pip
                            
                                need a correct eslintrc for async/await - using 7.6+ nodejs
                            
                                GLIBCXX_3.4.21 not found on CentOS 7
                            
                                Kotlin : Interface Queue does not have constructors
                            
                                Pandas: split column of lists of unequal length into multiple columns
                            
                                CustomView dependency injection with dagger 2 (within activity scope)
                            
                                How to specify multiple result path values in AWS Step Functions
                            
                                Ionic 3 not updating view
                            
                                Cannot have a pipe in an action expression ?
                            
                                lazy reference: doesn't provide model user?
                            
                                Symfony4 Error loading classes custom folder "Expected to find class... but it was not found"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With