Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is numerical encoding necessary for the target variable in classification?

I am using sklearn for text classification, all my features are numerical but my target variable labels are in text. I can understand the rationale behind encoding features to numerics but don't think this applies for the target variable?

like image 267
Nanda kumar Avatar asked Dec 14 '22 16:12

Nanda kumar


1 Answers

If your target variable is in textual form, you can transform it into numeric form (or you can leave it alone, please see my note below) in order for any Scikit-learn algorithm to pick it in an OVA (One Versus All) scheme: your learning algorithm will try to guess each class as compared against the residual ones only when they will be transformed into numeric codes starting from 0 to (number of classes - 1).

For instance, in this example from the Scikit-Learn documentation, you can figure out the class of your iris because there are three models that evaluate each possible class:

  • class 0 versus classes 1 and 2
  • class 1 versus classes 0 and 2
  • class 2 versus classes 0 and 1

Naturally, classes 0, 1 and 2 are Setosa, Versicolor, and Virginica, but the algorithm needs them expressed as numeric codes, as you can verify by exploring the results of the example code:

list(iris.target_names)
['setosa', 'versicolor', 'virginica']

np.unique(Y)
array([0, 1, 2])

NOTE: it is true that Scikit-learn encodes by itself the target labels if they are strings. On Scikit-learn's Github page for logistic regression (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py) you can see at rows 1623 and 1624 where the code calls the label encoder and it encodes labels automatically:

# Encode for string labels
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
like image 138
Luca Massaron Avatar answered Dec 29 '22 10:12

Luca Massaron