Given this simple example of multilabel classification (taken from this question, use scikit-learn to classify into multiple categories)
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
X_train = np.array(["new york is a hell of a town",
"new york was originally dutch",
"the big apple is great",
"new york is also called the big apple",
"nyc is nice",
"people abbreviate new york city as nyc",
"the capital of great britain is london",
"london is in the uk",
"london is in england",
"london is in great britain",
"it rains a lot in london",
"london hosts the british museum",
"new york is great and so is london",
"i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"], ["new york"],
["new york"],["london"],["london"],["london"],["london"],
["london"],["london"],["new york","london"],["new york","london"]]
X_test = np.array(['nice day in nyc',
'welcome to london',
'london is rainy',
'it is raining in britian',
'it is raining in britian and the big apple',
'it is raining in britian and nyc',
'hello welcome to new york. enjoy it here and london too'])
y_test_text = [["new york"],["london"],["london"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]
lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
print "Accuracy Score: ",accuracy_score(Y_test, predicted)
The code runs fine, and prints the accuracy score, however if I change y_test_text to
y_test_text = [["new york"],["london"],["england"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]
I get
Traceback (most recent call last):
File "/Users/scottstewart/Documents/scikittest/example.py", line 52, in <module>
print "Accuracy Score: ",accuracy_score(Y_test, predicted)
File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 181, in accuracy_score
differing_labels = count_nonzero(y_true - y_pred, axis=1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 393, in __sub__
raise ValueError("inconsistent shapes")
ValueError: inconsistent shapes
Notice the introduction of the 'england' label which is not in the training set. How do I use multilabel classification so that if a "test" label is introduced, i can still run some some of metrics? Or is that even possible?
EDIT: Thanks for answers guys, I guess my question is more about how the scikit binarizer works or should work. Given my short sample code, i would also expect if i changed y_test_text to
y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]
That it would work--i mean we have fitted for that label, but in this case I get
ValueError: Can't handle mix of binary and multilabel-indicator
You can, if you "introduce" the new label in the training y set too, like this:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
X_train = np.array(["new york is a hell of a town",
"new york was originally dutch",
"the big apple is great",
"new york is also called the big apple",
"nyc is nice",
"people abbreviate new york city as nyc",
"the capital of great britain is london",
"london is in the uk",
"london is in england",
"london is in great britain",
"it rains a lot in london",
"london hosts the british museum",
"new york is great and so is london",
"i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],
["new york"],["new york"],["london"],["london"],
["london"],["london"],["london"],["london"],
["new york","England"],["new york","london"]]
X_test = np.array(['nice day in nyc',
'welcome to london',
'london is rainy',
'it is raining in britian',
'it is raining in britian and the big apple',
'it is raining in britian and nyc',
'hello welcome to new york. enjoy it here and london too'])
y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]
lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)
print Y_test
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
print predicted
print "Accuracy Score: ",accuracy_score(Y_test, predicted)
Output:
Accuracy Score: 0.571428571429
The key section is:
y_train_text = [["new york"],["new york"],["new york"],
["new york"],["new york"],["new york"],
["london"],["london"],["london"],["london"],
["london"],["london"],["new york","England"],
["new york","london"]]
Where we inserted "England" too. It makes sense, because other way how can predict the classifier some label if he didn't see it before? So we created a three label classification problem this way.
EDITED:
lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))
You have to pass the classes as arg to MultiLabelBinarizer()
and it will work with any y_test_text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With