The basic task that I have at hand is
a) Read some tab separated data.
b) Do some basic preprocessing
c) For each categorical column use LabelEncoder
to create a mapping. This is don somewhat like this
mapper={} #Converting Categorical Data for x in categorical_list: mapper[x]=preprocessing.LabelEncoder() for x in categorical_list: df[x]=mapper[x].fit_transform(df.__getattr__(x))
where df
is a pandas dataframe and categorical_list
is a list of column headers that need to be transformed.
d) Train a classifier and save it to disk using pickle
e) Now in a different program, the model saved is loaded.
f) The test data is loaded and the same preprocessing is performed.
g) The LabelEncoder's
are used for converting categorical data.
h) The model is used to predict.
Now the question that I have is, will the step g)
work correctly?
As the documentation for LabelEncoder
says
It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.
So will each entry hash to the exact same value everytime?
If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?
LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier. The categorical values have been converted into numeric values. That's all label encoding is about.
Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.
LabelEncoder should be used to encode target values, i.e. y, and not the input X. Ordinal encoding should be used for ordinal variables (where order matters, like cold , warm , hot ); vs Label encoding should be used for non-ordinal (aka nominal) variables (where order doesn't matter, like blonde , brunette )
According to the LabelEncoder
implementation, the pipeline you've described will work correctly if and only if you fit
LabelEncoders at the test time with data that have exactly the same set of unique values.
There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder
has only one property, namely, classes_
. You can pickle it, and then restore like
Train:
encoder = LabelEncoder() encoder.fit(X) numpy.save('classes.npy', encoder.classes_)
Test
encoder = LabelEncoder() encoder.classes_ = numpy.load('classes.npy') # Now you should be able to use encoder # as you would do after `fit`
This seems more efficient than refitting it using the same data.
For me the easiest way was exporting LabelEncoder as .pkl
file for each column. You have to export the encoder for each column after using the fit_transform()
function
For example
from sklearn.preprocessing import LabelEncoder import pickle import pandas as pd df_train = pd.read_csv('traing_data.csv') le = LabelEncoder() df_train['Departure'] = le.fit_transform(df_train['Departure']) #exporting the departure encoder output = open('Departure_encoder.pkl', 'wb') pickle.dump(le, output) output.close()
Then in the testing project, you can load the LabelEncoder object and apply transform()
function directly
from sklearn.preprocessing import LabelEncoder import pandas as pd df_test = pd.read_csv('testing_data.csv') #load the encoder file import pickle pkl_file = open('Departure_encoder.pkl', 'rb') le_departure = pickle.load(pkl_file) pkl_file.close() df_test['Departure'] = le_departure.transform(df_test['Departure'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With