Usng same Label Encoder to test dataset? or new Label Encoder?

Tags:

scikit-learn

I'm totally novice on scikit-learn.

I want to know whether I should use the same Label Encoder instance that had used on training dataset or not when I want to convert the same feature's categorical data on test dataset. And, it means like below

from sklearn import preprocessing

# trainig data label encoding
le_blood_type = preprocessing.LabelEncoder()
df_training[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_training[ 'BLOOD_TYPE' ] )    # labeling from string
....
1. Using same label encoder
   df_test[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )

2. Using different label encoder
   le_for_test_blood_type = preprocessing.LabelEncoder()
   df_test[ 'BLOOD_TYPE' ] = le_for_test_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )

Which one is right code? Or, whatever I choose the above's code it does not make any differences because training dataset's categorical data and test dataset's categorical data should be the same as a result.

976

asked Jun 30 '15 08:06

mac475

2 Answers

The problem is the way you use it in fact.

As LabelEncoder is associating nominal feature to a numeric increment you should fit once and transform once the object has fitted. Don't forget that you need to have all your nominal feature in the training phase.

The good way to use it may be to have you nominal feature, do a fit on it, then only use the transform method.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) 
array([0, 0, 1, 2]...)

from official doc

164

answered Oct 14 '22 09:10

RPresle

I think RPresle has already gave the answer. Just wanted to put it a little more direct to the situation in the question:

In general, you just need to fit LabelEncoder (with feature in training set) once and transforms the feature in testing set. But if your testing set has feature values that are not in training set, when you fit the label encoder put union of set of training feature and of testing set in it.

answered Oct 14 '22 09:10

Undecided

Related questions
                            
                                How to not standarize target data in scikit learn regression
                            
                                TfidfVectorizer - Normalisation bias
                            
                                using best params from gridsearchcv
                            
                                Computing training score using cross_val_score
                            
                                My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%
                            
                                sklearn Hierarchical Agglomerative Clustering using similarity matrix
                            
                                How to interpret MSE in Keras Regressor
                            
                                How to get feature importance in logistic regression using weights?
                            
                                How to select only few columns in scikit learn column selector pipeline?
                            
                                SVM implmentation, scikits learn reducing runtime, fastest svm
                            
                                scikit-learn how to know documents in the cluster?
                            
                                Extract decision boundary with scikit-learn linear SVM
                            
                                Logistic Regression function on sklearn
                            
                                python logistic regression (beginner)
                            
                                conda update scikit-learn (also scipy and numpy)
                            
                                "Stratify" parameter from sklearn's train_test_split not working correctly?
                            
                                Train test split without using scikit learn
                            
                                Python: ValueError: setting an array element with a sequence
                            
                                LabelEncoder for categorical features?
                            
                                what is the difference between class weight = none and auto in svm scikit learn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With