Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pipeline OrdinalEncoder ValueError Found unknown categories

Please take it easy on me. I’m switching careers into data science and don’t have a CS or programming background—so I could be doing something profoundly stupid. I've researched for a few hours without success.

Objective: get Pipeline to run with OrdinalEncoder.

Problem: code does not run w/the OrdinalEncoder call. It does run w/o OrdinalEncoder. As best as I can tell I can pass two arguments, i.e. categories and dtype. Neither help.

I’m passing the public diabetes data set to the model. Is this the issue? IOW, is the passing of high cardinality features to OrdinalEncoder causing a problem between train/test data after model is built, i.e. the test split has a value that the train set does not?

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('imputer', SimpleImputer()),
    ('ordinal_encoder', OrdinalEncoder()),
    ('classifier', RandomForestClassifier(criterion='gini', n_estimators=100))])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Construct model
model = pipe.fit(X_train, y_train)

# Show results
print("Hold-out AUC score: %.3f" %roc_auc_score(model.predict_proba(X_test),y_test))

Here’s the error I’m getting:

ValueError: Found unknown categories [17.0] in column 0 during transform

What am I doing wrong?

Setup:

The scikit-learn version is 0.20.2.
3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43) 
[Clang 6.0 (clang-600.0.57)]
sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
like image 923
Pablo Honey Avatar asked Feb 22 '19 22:02

Pablo Honey


2 Answers

Your problem is that the model has encountered a value in the test data that it had not seen in the training data. This is fine. You just need to add the 'handle_unknown' argument to your encoder.

You should fit encoders and scalers to the training data (but not the test data) and then use them to transform both training and test data. Thus, you must plan for the possibility of unexpected values in the test data.

like image 150
Mark Avatar answered Sep 21 '22 23:09

Mark


I had the exact same problem, I just used OneHotEncoder(handle_unknown='ignore') instead of OneHotEncoder() and the issue was fixed.

like image 31
Anirudh R.Huilgol. Avatar answered Sep 18 '22 23:09

Anirudh R.Huilgol.