Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use sklearn Column Transformer?

Tags:

I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?

Below is my input data set and the code which i tried

Input Data set  Country Age Salary France  44  72000 Spain   27  48000 Germany 30  54000 Spain   38  61000 Germany 40  67000 France  35  58000 Spain   26  52000 France  48  79000 Germany 50  83000 France  37  67000   import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder, OneHotEncoder  #X is my dataset variable name  label_encoder = LabelEncoder() x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value hot_encoder = OneHotEncoder(categorical_features = [0]) x = hot_encoder.fit_transform(x).toarray() 

And the output i'm getting as, How can i get the same output with column transformer

0(fran) 1(ger) 2(spain) 3(age)  4(salary) 1         0       0      44        72000 0         0       1      27        48000 0         1       0      30        54000 0         0       1      38        61000 0         1       0      40        67000 1         0       0      35        58000 0         0       1      36        52000 1         0       0      48        79000 0         1       0      50        83000 1         0       0      37        67000 

i tried following code

from sklearn.compose import ColumnTransformer, make_column_transformer  preprocess = make_column_transformer(      ( [0], OneHotEncoder()) ) x = preprocess.fit_transform(x).toarray() 

i was able to encode country column with the above code, but missing age and salary column from x varible after transforming

like image 519
chinna g Avatar asked Jan 12 '19 14:01

chinna g


People also ask

What is column transformer in Sklearn?

The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.

What is the use of column transformer?

Column Transformer is a sciket-learn class used to create and apply separate transformers for numerical and categorical data. To create transformers we need to specify the transformer object and pass the list of transformations inside a tuple along with the column on which you want to apply the transformation.

How do I use OneHotEncoder in Python?

We can load this using the load_dataset() function: # One-hot encoding a single column from sklearn. preprocessing import OneHotEncoder from seaborn import load_dataset df = load_dataset('penguins') ohe = OneHotEncoder() transformed = ohe. fit_transform(df[['island']]) print(transformed.

How does pipeline work Sklearn?

Python scikit-learn provides a Pipeline utility to help automate machine learning workflows. Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.


2 Answers

It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I were you I would do:

import pandas as pd import numpy as np  from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder    numeric_features = ['Salary'] numeric_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='median')),     ('scaler', StandardScaler())])  categorical_features = ['Age','Country'] categorical_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),     ('onehot', OneHotEncoder(handle_unknown='ignore'))])  preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, numeric_features),         ('cat', categorical_transformer, categorical_features)]) 

from here you can pipe it with a classifier e.g.

clf = Pipeline(steps=[('preprocessor', preprocessor),                   ('classifier', LogisticRegression(solver='lbfgs'))])                      

Use it as so:

clf.fit(X_train,y_train) 

this will apply the preprocessor and then pass transformed data to the predictor.

Updates:

If we want to select data types on the fly, we can modify our preprocessor to use column selector by data dtypes:

from sklearn.compose import make_column_selector as selector  preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, selector(dtype_include="numeric")),         ('cat', categorical_transformer, selector(dtype_include="category"))]) 

Using GridSearch

param_grid = {     'preprocessor__num__imputer__strategy': ['mean', 'median'],     'classifier__C': [0.1, 1.0, 10, 100],     'classifier__solver': ['lbfgs', 'sag'], }  grid_search = GridSearchCV(clf, param_grid, cv=10) grid_search.fit(X_train,y_train) 

Getting names of features

 preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, selector(dtype_include="numeric")),         ('cat', categorical_transformer, selector(dtype_include="category"))],     verbose_feature_names_out=False, # added this line )  # now we can access feature names with  clf[:-1]. get_feature_names_out() # step before estimator  
like image 170
Prayson W. Daniel Avatar answered Oct 12 '22 13:10

Prayson W. Daniel


I think the poster is not trying to transform the Age and Salary. From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html), you ColumnTransformer (and make_column_transformer) only columns specified in the transformer (i.e., [0] in your example). You should set remainder="passthrough" to get the rest of the columns. In other words:

preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough") x = preprocessor.fit_transform(x) 
like image 28
passerby Avatar answered Oct 12 '22 14:10

passerby