Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use one-hot encoded labels with some sklearn classifiers?

I have a multiclass classification task with 10 classes. As such, I used sklearn's OneHotEncoder to transform the one-column labels to 10-columns labels. I was trying to fit the training data. Although I was able to do this with RandomForestClassifier, I got the below error message when fitting with GaussianNB:

ValueError: bad input shape (1203L, 10L)

I understand the allowed shape of y in these two classifiers is different:

GaussianNB:

y : array-like, shape (n_samples,)

RandomForest:

y : array-like, shape = [n_samples] or [n_samples, n_outputs]

The question is, why is this? Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"? Any way to go around it? Thanks!

like image 270
George Liu Avatar asked Oct 16 '25 02:10

George Liu


1 Answers

The question is, why is this?

It is because of a slight missunderstanding, in scikit-learn you do not encode labels, you pass it as one dimensional vector of labels, thus instead of

1 0 0
0 1 0
0 0 1

you literally pass

1 2 3

So why does random forest accepts a different scheme? Because it is not for multiclass setting! It is for multi label where each instance can have many labels, like

1 1 0
1 1 1
0 0 0

Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"?

Contrary - it is the easiest solution - to never ask for one-hot unless it is multi-label,

Any way to go around it?

Yup, just do not encode - pass raw labels :-)

like image 135
lejlot Avatar answered Oct 18 '25 09:10

lejlot



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!