Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Predicting multilabel data with sklearn

According to the docs, the OneVsRest classifier supports multilabel classification: http://scikit-learn.org/stable/modules/multiclass.html#multilabel-learning

Here's the code I'm trying to run:

from sklearn import metrics
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC

x = [[1,2,3],[3,3,2],[8,8,7],[3,7,1],[4,5,6]]
y = [['bar','foo'],['bar'],['foo'],['foo','jump'],['bar','fox','jump']]

y_enc = MultiLabelBinarizer().fit_transform(y)

train_x, train_y, test_x, test_y = train_test_split(x, y_enc, test_size=0.33)

clf = OneVsRestClassifier(SVC())
clf.fit(train_x, train_y)
predictions = clf.predict_proba(test_x)

my_metrics = metrics.classification_report( test_y, predictions)
print my_metrics

I get the following error:

Traceback (most recent call last):
  File "multilabel.py", line 178, in <module>
    clf.fit(train_x, train_y)
  File "/sklearn/lib/python2.6/site-packages/sklearn/multiclass.py", line 277, in fit
    Y = self.label_binarizer_.fit_transform(y)
  File "/sklearn/lib/python2.6/site-packages/sklearn/base.py", line 455, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/sklearn/lib/python2.6/site-packages/sklearn/preprocessing/label.py", line 302, in fit
    raise ValueError("Multioutput target data is not supported with "
ValueError: Multioutput target data is not supported with label binarization

Not using the MultiLabelBinarizer gives the same error, so I'm assuming that's not the problem. Does anyone know how to use this classifier for multilabel data?

like image 563
kormak Avatar asked May 06 '16 12:05

kormak


2 Answers

Your train_test_split() output is not correct. Change this line:

train_x, train_y, test_x, test_y = train_test_split(x, y_enc, test_size=0.33)

To this:

train_x, test_x, train_y, test_y = train_test_split(x, y_enc, test_size=0.33)

Also, to use probabilities instead of class predictions, you'll need to change SVC() to SVC(probability = True) and change clf.predict_proba to clf.predict.

Putting it all together:

from sklearn import metrics
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC


x = [[1,2,3],[3,3,2],[8,8,7],[3,7,1],[4,5,6]]
y = [['bar','foo'],['bar'],['foo'],['foo','jump'],['bar','fox','jump']]

mlb = MultiLabelBinarizer()
y_enc = mlb.fit_transform(y)

train_x, test_x, train_y, test_y = train_test_split(x, y_enc, test_size=0.33)

clf = OneVsRestClassifier(SVC(probability=True))
clf.fit(train_x, train_y)
predictions = clf.predict(test_x)

my_metrics = metrics.classification_report( test_y, predictions)
print my_metrics

This gives me no errors when I run it.

like image 190
mark s. Avatar answered Sep 30 '22 11:09

mark s.


I also experienced "ValueError: Multioutput target data is not supported with label binarization" with OneVsRestClassifier. My issue was caused by the type of training data was "list", after casting with np.array(), it works.

like image 25
user8431272 Avatar answered Sep 30 '22 10:09

user8431272