How to use sklearn Column Transformer?

Tags:

I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?

Below is my input data set and the code which i tried

Input Data set  Country Age Salary France  44  72000 Spain   27  48000 Germany 30  54000 Spain   38  61000 Germany 40  67000 France  35  58000 Spain   26  52000 France  48  79000 Germany 50  83000 France  37  67000   import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder, OneHotEncoder  #X is my dataset variable name  label_encoder = LabelEncoder() x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value hot_encoder = OneHotEncoder(categorical_features = [0]) x = hot_encoder.fit_transform(x).toarray()

And the output i'm getting as, How can i get the same output with column transformer

0(fran) 1(ger) 2(spain) 3(age)  4(salary) 1         0       0      44        72000 0         0       1      27        48000 0         1       0      30        54000 0         0       1      38        61000 0         1       0      40        67000 1         0       0      35        58000 0         0       1      36        52000 1         0       0      48        79000 0         1       0      50        83000 1         0       0      37        67000

i tried following code

from sklearn.compose import ColumnTransformer, make_column_transformer  preprocess = make_column_transformer(      ( [0], OneHotEncoder()) ) x = preprocess.fit_transform(x).toarray()

i was able to encode country column with the above code, but missing age and salary column from x varible after transforming

519

asked Jan 12 '19 14:01

chinna g

2 Answers

It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I were you I would do:

import pandas as pd import numpy as np  from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder    numeric_features = ['Salary'] numeric_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='median')),     ('scaler', StandardScaler())])  categorical_features = ['Age','Country'] categorical_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),     ('onehot', OneHotEncoder(handle_unknown='ignore'))])  preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, numeric_features),         ('cat', categorical_transformer, categorical_features)])

from here you can pipe it with a classifier e.g.

clf = Pipeline(steps=[('preprocessor', preprocessor),                   ('classifier', LogisticRegression(solver='lbfgs'))])

Use it as so:

clf.fit(X_train,y_train)

this will apply the preprocessor and then pass transformed data to the predictor.

Updates:

If we want to select data types on the fly, we can modify our preprocessor to use column selector by data dtypes:

from sklearn.compose import make_column_selector as selector  preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, selector(dtype_include="numeric")),         ('cat', categorical_transformer, selector(dtype_include="category"))])

Using GridSearch

param_grid = {     'preprocessor__num__imputer__strategy': ['mean', 'median'],     'classifier__C': [0.1, 1.0, 10, 100],     'classifier__solver': ['lbfgs', 'sag'], }  grid_search = GridSearchCV(clf, param_grid, cv=10) grid_search.fit(X_train,y_train)

Getting names of features

 preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, selector(dtype_include="numeric")),         ('cat', categorical_transformer, selector(dtype_include="category"))],     verbose_feature_names_out=False, # added this line )  # now we can access feature names with  clf[:-1]. get_feature_names_out() # step before estimator

170

answered Oct 12 '22 13:10

Prayson W. Daniel

I think the poster is not trying to transform the Age and Salary. From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html), you ColumnTransformer (and make_column_transformer) only columns specified in the transformer (i.e., [0] in your example). You should set remainder="passthrough" to get the rest of the columns. In other words:

preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough") x = preprocessor.fit_transform(x)

answered Oct 12 '22 14:10

passerby

Related questions
                            
                                Type error: Object is possibly 'null'. TS2531 for window.document
                            
                                Install Python 3.8 kernel in Google Colaboratory
                            
                                In ASP.Net, during which page lifecycle event does viewstate get loaded?
                            
                                Recommend an Open Source .NET Statistics Library [closed]
                            
                                Monkey-patching Vs. S.O.L.I.D. principles?
                            
                                Test class extending test class in dependency module
                            
                                Create Out-Of-Process COM in C#/.Net?
                            
                                How to specify an order for the columns in a matrix?
                            
                                Adding elements to python generators
                            
                                Downloading large files reliably in PHP
                            
                                ASP.net RequiredFieldValidator not preventing postback
                            
                                Best Scala imitation of Groovy's safe-dereference operator (?.)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With