Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding String to numbers so as to use it in scikit-learn

My data consists of 50 columns and most of them are strings. I have a single multi-class variable which I have to predict. I tried using LabelEncoder in scikit-learn to convert the features (not classes) into whole numbers and feed them as input to the RandomForest model I am using. I am using RandomForest for classification.

Now, when new test data comes (stream of new data), for each column, how will I know what the label for each string will be since using LabelEncoder now will give me a new label independent of the labels I generated before. Am, I doing this wrong? Is there anything else I should use for consistent encoding?

like image 882
Huga Avatar asked Jun 16 '15 13:06

Huga


People also ask

What is the difference between LabelEncoder and OrdinalEncoder?

OrdinalEncoder is for converting features, while LabelEncoder is for converting target variable.

Why is LabelEncoder used?

LabelEncoder can be used to normalize labels. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels. Fit label encoder. Fit label encoder and return encoded labels.

What is OneHotEncoder in Sklearn?

OneHotEncoder. Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.


2 Answers

The LabelEncoder class has two methods that handle this distinction: fit and transform. Typically you call fit first to map some data to a set of integers:

>>> le = LabelEncoder()
>>> le.fit(['a', 'e', 'b', 'z'])
>>> le.classes_
array(['a', 'b', 'e', 'z'], dtype='U1')

Once you've fit your encoder, you can transform any data to the label space, without changing the existing mapping:

>>> le.transform(['a', 'e', 'a', 'z', 'a', 'b'])
[0, 2, 0, 3, 0, 1]
>>> le.transform(['e', 'e', 'e'])
[2, 2, 2]

The use of this encoder basically assumes that you know beforehand what all the labels are in all of your data. If you have labels that might show up later (e.g., in an online learning scenario), you'll need to decide how to handle those outside the encoder.

like image 187
lmjohns3 Avatar answered Sep 24 '22 02:09

lmjohns3


You could save the mapping: string -> label in training data with each column.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> col_1 = ["paris", "paris", "tokyo", "amsterdam"]
>>> set_col_1 = list(set(col_1))
>>> le.fit(col_1)
>>> dict(zip(set_col_1, le.transform(set_col_1)))
{'amsterdam': 0, 'paris': 1, 'tokyo': 2}

When the testing data come, you could use those mapping to encode corresponding columns in testing data. You do not have to use encoder again in testing data.

like image 25
Chung-Yen Hung Avatar answered Sep 27 '22 02:09

Chung-Yen Hung