I am using label encoder during training and want to use same encoder in production by saving it and loading it later. Whatever solutions I have found online only allow Label Encoder to apply on the single column at a time like below:
for col in col_list:
df[col]= df[[col]].apply(LabelEncoder().fit_transform)
In this case how do I save it and use it later? Because I tried fitting on entire datafreame but I am getting following error.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
C:\Users\DA~1\AppData\Local\Temp/ipykernel_3884/730613134.py in <module>
----> 1 l_enc.fit_transform(df_join[le_col].astype(str))
~\anaconda3\envs\ReturnRate\lib\site-packages\sklearn\preprocessing\_label.py in fit_transform(self, y)
113 Encoded labels.
114 """
--> 115 y = column_or_1d(y, warn=True)
116 self.classes_, y = _unique(y, return_inverse=True)
117 return y
~\anaconda3\envs\ReturnRate\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
1022 return np.ravel(y)
1023
-> 1024 raise ValueError(
1025 "y should be a 1d array, got an array of shape {} instead.".format(shape)
1026 )
ValueError: y should be a 1d array, got an array of shape (3949037, 14) instead.
I want to fit label encoder to dataframe with 10 columns (all categorical), save it and load it later in production.
First, I want to point out labelEncoder is meant for encoding target variables. If you apply labelEncoder on your predictor variables, you are making them continuous, for example 0,1,2,3 etc, which may not make sense.
For categorical predictors you should use onehotencoding.
If you are sure about labelencode, it goes like this:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
df = pd.DataFrame({'f1':np.random.choice(['a','b','c'],100),
'f2':np.random.choice(['x','y','z'],100)})
col_list = ['f1','f2']
df[col_list].apply(LabelEncoder().fit_transform)
If you want to retain the encoder, you can store it in a dictionary:
le = {}
for col in col_list:
le[col] = LabelEncoder().fit(df[col].values)
le['f1'].transform(df['f1'])
array([1, 0, 2, 0, 2, 0, 2, 1, 1, 2, 0, 1, 2, 1, 1, 1, 0, 2, 1, 2, 1, 2,
2, 2, 0, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1,
0, 1, 1, 1, 2, 2, 1, 0, 2, 1, 2, 2, 2, 1, 0, 0, 2, 2, 0, 1, 2, 2,
0, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 0, 2, 0, 1, 1, 1, 0, 2, 0, 0, 2,
0, 1, 1, 2, 1, 0, 0, 2, 0, 1, 1, 2])
for col in col_list:
df[col] = le[col].transform(df[col])
Again I would give more thought about whether it is correct to use labelEncoding.
As @StupidWolf said, LabelEncoder
should be used solely to encode target variable.
scikit-learn offers multiple ways to encode categorical variable for feature vector:
OneHotEncoder
which encode categories into one hot numeric valuesOrdinalEncoder
which encode categories into numerical values.OrdinalEncoder
performs the same operation as LabelEncoder
but for feature values.
You can use ColumnTransformer
to wrapped different preprocessing into one object that can later easily be saved using pickle as in the example below:
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"col_1": ["A", "B", "B", "D", "E", "F", "A", "B"],
"col_2": ["X", "X", "X", "Y", "Y", "Z", "Z", "X"],
"col_3": [42] * 8,
}
)
cols = ["col_1", "col_2"]
pre_processeing = make_column_transformer(
(OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan), cols)
)
df.loc[:, cols] = pre_processeing.fit_transform(df[cols])
Which output:
col_1 col_2 col_3
0 0.0 0.0 42
1 1.0 0.0 42
2 1.0 0.0 42
3 2.0 1.0 42
4 3.0 1.0 42
5 4.0 2.0 42
6 0.0 2.0 42
7 1.0 0.0 42
Once the preprocessing is fitted it can easily be stored and loaded using pickle
as follows:
import pickle
#Dump preprocessing
pickle.dump(pre_processeing, open("pre_processing.p", "wb"))
#Load preprocessing
pre_processeing = pickle.load(open("pre_processing.p", "rb"))
To conclude, I would advocate that ColumnTransformer
has the following benefits:
StandardScaler
for numerical values)Pipeline
to have an end-to-end model that perform predictionpickle
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With