Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

Tags:

I have a series like:

df['ID'] = ['ABC123', 'IDF345', ...]

I'm using scikit's LabelEncoder to convert it to numerical values to be fed into the RandomForestClassifier.

During the training, I'm doing as follows:

le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)

But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.

In the test file, I was doing as follows:

new_df['ID'] = le_dpid.transform(new_df.ID)

But, I'm getting the following error: ValueError: y contains new labels

How do I fix this?? Thanks!

UPDATE:

So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low' values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.

df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

And then predict it on something like:

BankNum   |  ID | 

00982222  | AB999 | 
00982222  | AB999 |
00981111  | AB890 |

I'm doing something like this:

df['BankNum'] = df.BankNum.astype(np.float128)

    le_id = LabelEncoder()
    df['ID'] = le_id.fit_transform(df.ID)

X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
    clf = RandomForestClassifier(random_state=42, n_estimators=140)
    clf.fit(X_train, y_train)

961

asked Sep 18 '17 21:09

Xavier

1 Answers

I think the error message is very clear: Your test dataset contains ID labels which have not been included in your training data set. For this items, the LabelEncoder can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.

One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID values, train the LabelEncoder on this list, and keep the rest of your code just as it is at the moment.

An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id (or something like this). Doin this, you put all new, unknown IDs in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.

answered Oct 08 '22 12:10

zimmerrol

Related questions
                            
                                Switch between spyder for python 2 and 3
                            
                                Interactive plotting in Pycharm debug console through matplotlib
                            
                                Seaborn workaround for hue barplot
                            
                                ImportError: cannot import name 'InsecureRequestWarning' (PYTHON3 | RPI3 | gTTS)
                            
                                Get all objects (artists) drawn on a figure
                            
                                Python/NetworkX: Add Weights to Edges by Frequency of Edge Occurance
                            
                                Seaborn: Setting a distplot bin range?
                            
                                Matplotlib image not coming up in Visual Studio Code on Mac
                            
                                Using Jinja2 with Django, load tag does not work
                            
                                calling a method inside a class-Python
                            
                                Find likeliest periodicity for time series with numpy's Fourier Transform?
                            
                                How does this one-hot vector conversion work?
                            
                                Sort 2 lists in Python based on the ratio of individual corresponding elements or based on a third list
                            
                                In Python, how can I loop over all the matches of a regular expression on a string?
                            
                                What is the new upload URL for the Test PyPI server?
                            
                                Stateful LSTM: When to reset states?
                            
                                Log from multiple python files into single log file in Python
                            
                                Pass a fixture to a helper function in PyTest?
                            
                                Possible to add descriptions to symbols in sympy?
                            
                                Highly inconsistent OCR result for tesseract

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

Tags:

python

encoding

machine-learning

scikit-learn

categorical-data

Xavier

People also ask

1 Answers

zimmerrol

Recent Activity

Donate For Us