I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?
Below is my input data set and the code which i tried
Input Data set Country Age Salary France 44 72000 Spain 27 48000 Germany 30 54000 Spain 38 61000 Germany 40 67000 France 35 58000 Spain 26 52000 France 48 79000 Germany 50 83000 France 37 67000 import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder, OneHotEncoder #X is my dataset variable name label_encoder = LabelEncoder() x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value hot_encoder = OneHotEncoder(categorical_features = [0]) x = hot_encoder.fit_transform(x).toarray()
And the output i'm getting as, How can i get the same output with column transformer
0(fran) 1(ger) 2(spain) 3(age) 4(salary) 1 0 0 44 72000 0 0 1 27 48000 0 1 0 30 54000 0 0 1 38 61000 0 1 0 40 67000 1 0 0 35 58000 0 0 1 36 52000 1 0 0 48 79000 0 1 0 50 83000 1 0 0 37 67000
i tried following code
from sklearn.compose import ColumnTransformer, make_column_transformer preprocess = make_column_transformer( ( [0], OneHotEncoder()) ) x = preprocess.fit_transform(x).toarray()
i was able to encode country column with the above code, but missing age and salary column from x varible after transforming
The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.
Column Transformer is a sciket-learn class used to create and apply separate transformers for numerical and categorical data. To create transformers we need to specify the transformer object and pass the list of transformations inside a tuple along with the column on which you want to apply the transformation.
We can load this using the load_dataset() function: # One-hot encoding a single column from sklearn. preprocessing import OneHotEncoder from seaborn import load_dataset df = load_dataset('penguins') ohe = OneHotEncoder() transformed = ohe. fit_transform(df[['island']]) print(transformed.
Python scikit-learn provides a Pipeline utility to help automate machine learning workflows. Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.
It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I were you I would do:
import pandas as pd import numpy as np from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder numeric_features = ['Salary'] numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) categorical_features = ['Age','Country'] categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)])
from here you can pipe it with a classifier e.g.
clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression(solver='lbfgs'))])
Use it as so:
clf.fit(X_train,y_train)
this will apply the preprocessor and then pass transformed data to the predictor.
If we want to select data types on the fly, we can modify our preprocessor to use column selector by data dtypes:
from sklearn.compose import make_column_selector as selector preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, selector(dtype_include="numeric")), ('cat', categorical_transformer, selector(dtype_include="category"))])
Using GridSearch
param_grid = { 'preprocessor__num__imputer__strategy': ['mean', 'median'], 'classifier__C': [0.1, 1.0, 10, 100], 'classifier__solver': ['lbfgs', 'sag'], } grid_search = GridSearchCV(clf, param_grid, cv=10) grid_search.fit(X_train,y_train)
Getting names of features
preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, selector(dtype_include="numeric")), ('cat', categorical_transformer, selector(dtype_include="category"))], verbose_feature_names_out=False, # added this line ) # now we can access feature names with clf[:-1]. get_feature_names_out() # step before estimator
I think the poster is not trying to transform the Age and Salary. From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html), you ColumnTransformer (and make_column_transformer) only columns specified in the transformer (i.e., [0] in your example). You should set remainder="passthrough" to get the rest of the columns. In other words:
preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough") x = preprocessor.fit_transform(x)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With