Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn Transformers: How to apply encoder to multiple columns and reuse it in production?

I am using label encoder during training and want to use same encoder in production by saving it and loading it later. Whatever solutions I have found online only allow Label Encoder to apply on the single column at a time like below:

for col in col_list:
    df[col]= df[[col]].apply(LabelEncoder().fit_transform)

In this case how do I save it and use it later? Because I tried fitting on entire datafreame but I am getting following error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
C:\Users\DA~1\AppData\Local\Temp/ipykernel_3884/730613134.py in <module>
----> 1 l_enc.fit_transform(df_join[le_col].astype(str))

~\anaconda3\envs\ReturnRate\lib\site-packages\sklearn\preprocessing\_label.py in fit_transform(self, y)
    113             Encoded labels.
    114         """
--> 115         y = column_or_1d(y, warn=True)
    116         self.classes_, y = _unique(y, return_inverse=True)
    117         return y

~\anaconda3\envs\ReturnRate\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
   1022         return np.ravel(y)
   1023 
-> 1024     raise ValueError(
   1025         "y should be a 1d array, got an array of shape {} instead.".format(shape)
   1026     )

ValueError: y should be a 1d array, got an array of shape (3949037, 14) instead.

I want to fit label encoder to dataframe with 10 columns (all categorical), save it and load it later in production.

like image 371
dan Avatar asked Nov 16 '21 08:11

dan


2 Answers

First, I want to point out labelEncoder is meant for encoding target variables. If you apply labelEncoder on your predictor variables, you are making them continuous, for example 0,1,2,3 etc, which may not make sense.

For categorical predictors you should use onehotencoding.

If you are sure about labelencode, it goes like this:

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

df = pd.DataFrame({'f1':np.random.choice(['a','b','c'],100),
'f2':np.random.choice(['x','y','z'],100)})

col_list = ['f1','f2']

df[col_list].apply(LabelEncoder().fit_transform)

If you want to retain the encoder, you can store it in a dictionary:

le = {}
for col in col_list:
    le[col] = LabelEncoder().fit(df[col].values)

le['f1'].transform(df['f1'])

array([1, 0, 2, 0, 2, 0, 2, 1, 1, 2, 0, 1, 2, 1, 1, 1, 0, 2, 1, 2, 1, 2,
       2, 2, 0, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1,
       0, 1, 1, 1, 2, 2, 1, 0, 2, 1, 2, 2, 2, 1, 0, 0, 2, 2, 0, 1, 2, 2,
       0, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 0, 2, 0, 1, 1, 1, 0, 2, 0, 0, 2,
       0, 1, 1, 2, 1, 0, 0, 2, 0, 1, 1, 2])

for col in col_list:
    df[col] = le[col].transform(df[col])

Again I would give more thought about whether it is correct to use labelEncoding.

like image 198
StupidWolf Avatar answered Oct 17 '22 07:10

StupidWolf


As @StupidWolf said, LabelEncoder should be used solely to encode target variable.

scikit-learn offers multiple ways to encode categorical variable for feature vector:

  • OneHotEncoder which encode categories into one hot numeric values
  • OrdinalEncoder which encode categories into numerical values.

OrdinalEncoder performs the same operation as LabelEncoder but for feature values.

You can use ColumnTransformer to wrapped different preprocessing into one object that can later easily be saved using pickle as in the example below:

from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "col_1": ["A", "B", "B", "D", "E", "F", "A", "B"],
        "col_2": ["X", "X", "X", "Y", "Y", "Z", "Z", "X"],
        "col_3": [42] * 8,
    }
)
cols = ["col_1", "col_2"]

pre_processeing = make_column_transformer(
    (OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan), cols)
)


df.loc[:, cols] = pre_processeing.fit_transform(df[cols])

Which output:

   col_1  col_2  col_3
0    0.0    0.0     42
1    1.0    0.0     42
2    1.0    0.0     42
3    2.0    1.0     42
4    3.0    1.0     42
5    4.0    2.0     42
6    0.0    2.0     42
7    1.0    0.0     42

Once the preprocessing is fitted it can easily be stored and loaded using pickle as follows:

import pickle

#Dump preprocessing
pickle.dump(pre_processeing, open("pre_processing.p", "wb"))

#Load preprocessing
pre_processeing = pickle.load(open("pre_processing.p", "rb"))

To conclude, I would advocate that ColumnTransformer has the following benefits:

  • Additional preprocessing for different columns can easily be added (e.g StandardScaler for numerical values)
  • This preprocessing can be included within a Pipeline to have an end-to-end model that perform prediction
  • This is a single object that can be easily serialized with pickle.
like image 35
Antoine Dubuis Avatar answered Oct 17 '22 05:10

Antoine Dubuis