Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

I have a series like:

df['ID'] = ['ABC123', 'IDF345', ...]

I'm using scikit's LabelEncoder to convert it to numerical values to be fed into the RandomForestClassifier.

During the training, I'm doing as follows:

le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID) 

But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.

In the test file, I was doing as follows:

new_df['ID'] = le_dpid.transform(new_df.ID)

But, I'm getting the following error: ValueError: y contains new labels

How do I fix this?? Thanks!

UPDATE:

So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low' values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.

df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

And then predict it on something like:

BankNum   |  ID | 

00982222  | AB999 | 
00982222  | AB999 |
00981111  | AB890 |

I'm doing something like this:

df['BankNum'] = df.BankNum.astype(np.float128)

    le_id = LabelEncoder()
    df['ID'] = le_id.fit_transform(df.ID)

X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
    clf = RandomForestClassifier(random_state=42, n_estimators=140)
    clf.fit(X_train, y_train)
like image 961
Xavier Avatar asked Sep 18 '17 21:09

Xavier


People also ask

What is LabelEncoder () in Python?

LabelEncoder[source] Encode target labels with value between 0 and n_classes-1. This transformer should be used to encode target values, i.e. y , and not the input X . Read more in the User Guide.

How does LabelEncoder work Sklearn?

Label Encoder: Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.

What is the difference between LabelEncoder and OrdinalEncoder?

OrdinalEncoder is for converting features, while LabelEncoder is for converting target variable.

What is label encoding?

Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.


1 Answers

I think the error message is very clear: Your test dataset contains ID labels which have not been included in your training data set. For this items, the LabelEncoder can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.

One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID values, train the LabelEncoder on this list, and keep the rest of your code just as it is at the moment.

An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id (or something like this). Doin this, you put all new, unknown IDs in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.

like image 88
zimmerrol Avatar answered Oct 08 '22 12:10

zimmerrol