Pipeline OrdinalEncoder ValueError Found unknown categories

Question

Please take it easy on me. I’m switching careers into data science and don’t have a CS or programming background—so I could be doing something profoundly stupid. I've researched for a few hours without success.

Objective: get Pipeline to run with OrdinalEncoder.

Problem: code does not run w/the OrdinalEncoder call. It does run w/o OrdinalEncoder. As best as I can tell I can pass two arguments, i.e. categories and dtype. Neither help.

I’m passing the public diabetes data set to the model. Is this the issue? IOW, is the passing of high cardinality features to OrdinalEncoder causing a problem between train/test data after model is built, i.e. the test split has a value that the train set does not?

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('imputer', SimpleImputer()),
    ('ordinal_encoder', OrdinalEncoder()),
    ('classifier', RandomForestClassifier(criterion='gini', n_estimators=100))])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Construct model
model = pipe.fit(X_train, y_train)

# Show results
print("Hold-out AUC score: %.3f" %roc_auc_score(model.predict_proba(X_test),y_test))

Here’s the error I’m getting:

ValueError: Found unknown categories [17.0] in column 0 during transform

What am I doing wrong?

Setup:

The scikit-learn version is 0.20.2.
3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43) 
[Clang 6.0 (clang-600.0.57)]
sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)

Mark · Accepted Answer

Your problem is that the model has encountered a value in the test data that it had not seen in the training data. This is fine. You just need to add the 'handle_unknown' argument to your encoder.

You should fit encoders and scalers to the training data (but not the test data) and then use them to transform both training and test data. Thus, you must plan for the possibility of unexpected values in the test data.

Anirudh R.Huilgol. · Answer

I had the exact same problem, I just used OneHotEncoder(handle_unknown='ignore') instead of OneHotEncoder() and the issue was fixed.

Pipeline OrdinalEncoder ValueError Found unknown categories

Tags:

python-3.x

ordinal

scikit-learn

valueerror

pipeline

Pablo Honey

2 Answers

Mark

Anirudh R.Huilgol.

Recent Activity

Donate For Us

Pipeline OrdinalEncoder ValueError Found unknown categories

Tags:

python-3.x

ordinal

scikit-learn

valueerror

pipeline

Pablo Honey

2 Answers

Mark

Anirudh R.Huilgol.

Related questions

Recent Activity

Donate For Us