I have a multiclass classification task with 10 classes. As such, I used sklearn's OneHotEncoder to transform the one-column labels to 10-columns labels. I was trying to fit the training data. Although I was able to do this with RandomForestClassifier, I got the below error message when fitting with GaussianNB:
ValueError: bad input shape (1203L, 10L)
I understand the allowed shape of y in these two classifiers is different:
GaussianNB:
y : array-like, shape (n_samples,)
RandomForest:
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The question is, why is this? Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"? Any way to go around it? Thanks!
The question is, why is this?
It is because of a slight missunderstanding, in scikit-learn you do not encode labels, you pass it as one dimensional vector of labels, thus instead of
1 0 0
0 1 0
0 0 1
you literally pass
1 2 3
So why does random forest accepts a different scheme? Because it is not for multiclass setting! It is for multi label where each instance can have many labels, like
1 1 0
1 1 1
0 0 0
Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"?
Contrary - it is the easiest solution - to never ask for one-hot unless it is multi-label,
Any way to go around it?
Yup, just do not encode - pass raw labels :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With