Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Usng same Label Encoder to test dataset? or new Label Encoder?

Tags:

scikit-learn

I'm totally novice on scikit-learn.

I want to know whether I should use the same Label Encoder instance that had used on training dataset or not when I want to convert the same feature's categorical data on test dataset. And, it means like below

from sklearn import preprocessing

# trainig data label encoding
le_blood_type = preprocessing.LabelEncoder()
df_training[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_training[ 'BLOOD_TYPE' ] )    # labeling from string
....
1. Using same label encoder
   df_test[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )

2. Using different label encoder
   le_for_test_blood_type = preprocessing.LabelEncoder()
   df_test[ 'BLOOD_TYPE' ] = le_for_test_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )

Which one is right code? Or, whatever I choose the above's code it does not make any differences because training dataset's categorical data and test dataset's categorical data should be the same as a result.

like image 976
mac475 Avatar asked Jun 30 '15 08:06

mac475


People also ask

What is the limitation of the label encoding method?

Limitation of label Encoding Label encoding converts the data in machine-readable form, but it assigns a unique number(starting from 0) to each class of data. This may lead to the generation of priority issues in the training of data sets.

How does label encoder work?

Label Encoder: LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier. The categorical values have been converted into numeric values. That's all label encoding is about.


2 Answers

The problem is the way you use it in fact.

As LabelEncoder is associating nominal feature to a numeric increment you should fit once and transform once the object has fitted. Don't forget that you need to have all your nominal feature in the training phase.

The good way to use it may be to have you nominal feature, do a fit on it, then only use the transform method.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) 
array([0, 0, 1, 2]...)

from official doc

like image 164
RPresle Avatar answered Oct 14 '22 09:10

RPresle


I think RPresle has already gave the answer. Just wanted to put it a little more direct to the situation in the question:

In general, you just need to fit LabelEncoder (with feature in training set) once and transforms the feature in testing set. But if your testing set has feature values that are not in training set, when you fit the label encoder put union of set of training feature and of testing set in it.

like image 28
Undecided Avatar answered Oct 14 '22 09:10

Undecided