The basic task that I have at hand is a) Read some tab separated data. b) Do some basic preprocessing c) For each categorical column use <code>LabelEncoder</code> to create a mapping. This is don somewhat like this <pre class="prettyprint"><code>mapper={} #Converting Categorical Data for x in categorical_list: mapper[x]=preprocessing.LabelEncoder() for x in categorical_list: df[x]=mapper[x].fit_transform(df.__getattr__(x)) </code></pre> where <code>df</code> is a pandas dataframe and <code>categorical_list</code> is a list of column headers that need to be transformed. d) Train a classifier and save it to disk using <code>pickle</code> e) Now in a different program, the model saved is loaded. f) The test data is loaded and the same preprocessing is performed. g) The <code>LabelEncoder's</code> are used for converting categorical data. h) The model is used to predict. Now the question that I have is, will the step <code>g)</code> work correctly? As the documentation for <code>LabelEncoder</code> says <pre class="prettyprint"><code>It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels. </code></pre> So will each entry hash to the exact same value everytime? If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?

According to the <code>LabelEncoder</code> implementation, the pipeline you've described will work correctly if and only if you <code>fit</code> LabelEncoders at the test time with data that have exactly the same set of unique values. There's a somewhat hacky way to reuse LabelEncoders you got during train. <code>LabelEncoder</code> has only one property, namely, <code>classes_</code>. You can pickle it, and then restore like Train: <pre class="prettyprint"><code>encoder = LabelEncoder() encoder.fit(X) numpy.save('classes.npy', encoder.classes_) </code></pre> Test <pre class="prettyprint"><code>encoder = LabelEncoder() encoder.classes_ = numpy.load('classes.npy') # Now you should be able to use encoder # as you would do after `fit` </code></pre> This seems more efficient than refitting it using the same data.

Using Scikit's LabelEncoder correctly across multiple programs

Tags:

python

pandas

scikit-learn

The basic task that I have at hand is

a) Read some tab separated data.

b) Do some basic preprocessing

c) For each categorical column use LabelEncoder to create a mapping. This is don somewhat like this

mapper={} #Converting Categorical Data for x in categorical_list:      mapper[x]=preprocessing.LabelEncoder()  for x in categorical_list:      df[x]=mapper[x].fit_transform(df.__getattr__(x))

where df is a pandas dataframe and categorical_list is a list of column headers that need to be transformed.

d) Train a classifier and save it to disk using pickle

e) Now in a different program, the model saved is loaded.

f) The test data is loaded and the same preprocessing is performed.

g) The LabelEncoder's are used for converting categorical data.

h) The model is used to predict.

Now the question that I have is, will the step g) work correctly?

As the documentation for LabelEncoder says

It can also be used to transform non-numerical labels (as long as  they are hashable and comparable) to numerical labels.

So will each entry hash to the exact same value everytime?

If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?

401

asked Feb 22 '15 10:02

alphacentauri

2 Answers

According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

Train:

encoder = LabelEncoder() encoder.fit(X) numpy.save('classes.npy', encoder.classes_)

Test

encoder = LabelEncoder() encoder.classes_ = numpy.load('classes.npy') # Now you should be able to use encoder # as you would do after `fit`

This seems more efficient than refitting it using the same data.

123

answered Sep 21 '22 21:09

Artem Sobolev

For me the easiest way was exporting LabelEncoder as .pkl file for each column. You have to export the encoder for each column after using the fit_transform() function

For example

from sklearn.preprocessing import LabelEncoder import pickle import pandas as pd df_train = pd.read_csv('traing_data.csv') le = LabelEncoder()     df_train['Departure'] = le.fit_transform(df_train['Departure']) #exporting the departure encoder output = open('Departure_encoder.pkl', 'wb') pickle.dump(le, output) output.close()

Then in the testing project, you can load the LabelEncoder object and apply transform() function directly

from sklearn.preprocessing import LabelEncoder import pandas as pd df_test = pd.read_csv('testing_data.csv') #load the encoder file import pickle  pkl_file = open('Departure_encoder.pkl', 'rb') le_departure = pickle.load(pkl_file)  pkl_file.close() df_test['Departure'] = le_departure.transform(df_test['Departure'])

answered Sep 17 '22 21:09

Shady Mohamed Sherif

Related questions
                            
                                Python: reduce precision pandas timestamp dataframe
                            
                                Python/Pandas convert string to time only
                            
                                How to subtract strings in python
                            
                                How to bind events to Canvas items?
                            
                                Python List & for-each access (Find/Replace in built-in list)
                            
                                How do I print entire number in Python from describe() function?
                            
                                Difference between min_samples_split and min_samples_leaf in sklearn DecisionTreeClassifier
                            
                                Python: How to check if a nested list is essentially empty?
                            
                                What is the idiomatic way to iterate over a binary file?
                            
                                How to create SaaS application with Python and Django
                            
                                NameError: global name 'reduce' is not defined
                            
                                How to attach a Scrollbar to a Text widget?
                            
                                Downloading file to specified location with Selenium and python
                            
                                How to replace a value in pandas, with NaN?
                            
                                matplotlib (mplot3d) - how to increase the size of an axis (stretch) in a 3D Plot?
                            
                                Send Post Request in Scrapy
                            
                                What is NLTK POS tagger asking me to download?
                            
                                How to create a white image in Python?
                            
                                Any way to add a new line from a string with the '\n' character in flask?
                            
                                Flask request.remote_addr is wrong on webfaction and not showing real user IP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With