Please take it easy on me. I’m switching careers into data science and don’t have a CS or programming background—so I could be doing something profoundly stupid. I've researched for a few hours without success.
Objective: get Pipeline to run with OrdinalEncoder.
Problem: code does not run w/the OrdinalEncoder call. It does run w/o OrdinalEncoder. As best as I can tell I can pass two arguments, i.e. categories and dtype. Neither help.
I’m passing the public diabetes data set to the model. Is this the issue? IOW, is the passing of high cardinality features to OrdinalEncoder causing a problem between train/test data after model is built, i.e. the test split has a value that the train set does not?
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([
('imputer', SimpleImputer()),
('ordinal_encoder', OrdinalEncoder()),
('classifier', RandomForestClassifier(criterion='gini', n_estimators=100))])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Construct model
model = pipe.fit(X_train, y_train)
# Show results
print("Hold-out AUC score: %.3f" %roc_auc_score(model.predict_proba(X_test),y_test))
Here’s the error I’m getting:
ValueError: Found unknown categories [17.0] in column 0 during transform
What am I doing wrong?
Setup:
The scikit-learn version is 0.20.2.
3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
[Clang 6.0 (clang-600.0.57)]
sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Your problem is that the model has encountered a value in the test data that it had not seen in the training data. This is fine. You just need to add the 'handle_unknown' argument to your encoder.
You should fit
encoders and scalers to the training data (but not the test data) and then use them to transform
both training and test data. Thus, you must plan for the possibility of unexpected values in the test data.
I had the exact same problem, I just used OneHotEncoder(handle_unknown='ignore')
instead of OneHotEncoder()
and the issue was fixed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With