Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Scikit's LabelEncoder correctly across multiple programs

The basic task that I have at hand is

a) Read some tab separated data.

b) Do some basic preprocessing

c) For each categorical column use LabelEncoder to create a mapping. This is don somewhat like this

mapper={} #Converting Categorical Data for x in categorical_list:      mapper[x]=preprocessing.LabelEncoder()  for x in categorical_list:      df[x]=mapper[x].fit_transform(df.__getattr__(x)) 

where df is a pandas dataframe and categorical_list is a list of column headers that need to be transformed.

d) Train a classifier and save it to disk using pickle

e) Now in a different program, the model saved is loaded.

f) The test data is loaded and the same preprocessing is performed.

g) The LabelEncoder's are used for converting categorical data.

h) The model is used to predict.

Now the question that I have is, will the step g) work correctly?

As the documentation for LabelEncoder says

It can also be used to transform non-numerical labels (as long as  they are hashable and comparable) to numerical labels. 

So will each entry hash to the exact same value everytime?

If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?

like image 401
alphacentauri Avatar asked Feb 22 '15 10:02

alphacentauri


People also ask

How does LabelEncoder work in Python?

LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier. The categorical values have been converted into numeric values. That's all label encoding is about.

Why do we need label encoding?

Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

What is the difference between label encoding and ordinal encoding?

LabelEncoder should be used to encode target values, i.e. y, and not the input X. Ordinal encoding should be used for ordinal variables (where order matters, like cold , warm , hot ); vs Label encoding should be used for non-ordinal (aka nominal) variables (where order doesn't matter, like blonde , brunette )


2 Answers

According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

Train:

encoder = LabelEncoder() encoder.fit(X) numpy.save('classes.npy', encoder.classes_) 

Test

encoder = LabelEncoder() encoder.classes_ = numpy.load('classes.npy') # Now you should be able to use encoder # as you would do after `fit` 

This seems more efficient than refitting it using the same data.

like image 123
Artem Sobolev Avatar answered Sep 21 '22 21:09

Artem Sobolev


For me the easiest way was exporting LabelEncoder as .pkl file for each column. You have to export the encoder for each column after using the fit_transform() function

For example

from sklearn.preprocessing import LabelEncoder import pickle import pandas as pd df_train = pd.read_csv('traing_data.csv') le = LabelEncoder()     df_train['Departure'] = le.fit_transform(df_train['Departure']) #exporting the departure encoder output = open('Departure_encoder.pkl', 'wb') pickle.dump(le, output) output.close() 

Then in the testing project, you can load the LabelEncoder object and apply transform() function directly

from sklearn.preprocessing import LabelEncoder import pandas as pd df_test = pd.read_csv('testing_data.csv') #load the encoder file import pickle  pkl_file = open('Departure_encoder.pkl', 'rb') le_departure = pickle.load(pkl_file)  pkl_file.close() df_test['Departure'] = le_departure.transform(df_test['Departure']) 
like image 30
Shady Mohamed Sherif Avatar answered Sep 17 '22 21:09

Shady Mohamed Sherif