I am dealing with multi-label classification with OneVsRestClassifier
and SVC
,
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
L=3
X, y = make_multilabel_classification(n_classes=L, n_labels=2,
allow_unlabeled=True,
random_state=1, return_indicator=True)
model_to_set = OneVsRestClassifier(SVC())
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly","rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
scoring='f1')
model_tunning.fit(X, y)
print model_tunning.best_score_
print model_tunning.best_params_
#0.855175822314
#{'estimator__kernel': 'poly', 'estimator__C': 1, 'estimator__degree': 3}
1st question
What is the number 0.85
representing for? Is it the best score among the L
classifiers or the averaged one? Similarly, does the set of parameters stand for the best-scorer among L
classifiers?
2nd question
Based on the fact that, if I am right, the OneVsRestClassifier
literally builds L
classifiers for each label, one can expect to access or observe the performance of EACH LABEL. But how, in the above example, to obtain L
scores from the GridSearchCV
object?
EDIT
To simplify the problem and help myself understand more about OneVsRestClassifier
, before tuning model,
model_to_set.fit(X,y)
gp = model_to_set.predict(X) # the "global" prediction
fp = model_to_set.estimators_[0].predict(X) # the first-class prediction
sp = model_to_set.estimators_[1].predict(X) # the second-class prediction
tp = model_to_set.estimators_[2].predict(X) # the third-class prediction
It can be shown that gp.T[0]==fp
, gp.T[1]==sp
and gp.T[2]==tp
. So the "global" prediction is simply the 'sequential' L
individual predictions and the 2nd question is solved.
But it is still confusing for me that if one meta-classifier OneVsRestClassifier
contains L
classifiers, how can GridSearchCV
returns only ONE best score, corresponding to one of 4*2*4 sets of parameters, for a meta-classifier OneVsRestClassifier
having L
classifiers?
It would be fairly appreciated to see any comment.
Adapted algorithm This technique uses adaptive algorithms, which are used to perform multi-label classification rather than conducting problem transformation directly. In Scikit-multilearn, we have multi-label-k-nearest-neighbor (MLkNN), which is used to handle multi-label classification.
One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for using binary classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary classification problems.
Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.
Multi-label classification involves predicting zero or more class labels. Unlike normal classification tasks where class labels are mutually exclusive, multi-label classification requires specialized machine learning algorithms that support predicting multiple mutually non-exclusive classes or “labels.”
GridSearchCV
creates grid from your parameter values, it evaluates your OneVsRestClassifier
as atomic classifier (I.e. GridSearchCV
doesn't know what is inside this metaclassifier)
First: 0.85 is the best score of OneVsRestClassifier
among all possible combinations (16 combinations in your case, 4*2*4) of parameters ("estimator__C", "estimator__kernel", "estimator__degree")
, it means that GridSearchCV
evaluates 16 (again, it's only in this particular case) possible OneVsRestClassifier
's each of which contains L SVC
's. All of that L classifiers inside one OneVsRestClassifier
have same values of parameters (but each of them is learning to recognize their own class from L possible)
i.e. from set of
{OneVsRestClassifier(SVC(C=1, kernel="poly", degree=1)),
OneVsRestClassifier(SVC(C=1, kernel="poly", degree=2)),
...,
OneVsRestClassifier(SVC(C=8, kernel="rbf", degree=3)),
OneVsRestClassifier(SVC(C=8, kernel="rbf", degree=4))}
it chooses one with the best score.
model_tunning.best_params_
here represents parameters for OneVsRestClassifier(SVC()) with which it will achieve model_tunning.best_score_
.
You can get that best OneVsRestClassifier
from model_tunning.best_estimator_
attribute.
Second: There is no ready to use code to obtain separate scores for L classifiers from OneVsRestClassifier
, but you can look at implementation of OneVsRestClassifier.fit
method, or take this (should work :) ):
# Here X, y - your dataset
one_vs_rest = model_tunning.best_estimator_
yT = one_vs_rest.label_binarizer_.transform(y).toarray().T
# Iterate through all L classifiers
for classifier, is_ith_class in zip(one_vs_rest.estimators_, yT):
print(classifier.score(X, is_ith_class))
Inspired by @Olologin 's answer, I realized that 0.85 is the best weighted average of f1 scores (in this example) obtained by L
predictions. In the following code, I evaluate the model by inner test, using macro average of f1 score:
# Case A, inspect F1 score using the meta-classifier
F_A = f1_score(y, model_tunning.best_estimator_.predict(X), average='macro')
# Case B, inspect F1 scores of each label (binary task) and collect them by macro average
F_B = []
for label, clc in zip(y.T, model_tunning.best_estimator_.estimators_):
F_B.append(f1_score(label, clf.predict(X)))
F_B = mean(F_B)
F_A==F_B # True
So it implies that the GridSearchCV
applies one of 4*2*4 sets of parameters to build the meta-classifier which in turn makes prediction on each label with one of the L
classifiers. The outcome will be L
f1 scores for L
labels, each of which is a performance of a binary task. Finally, a single score is obtained by taking average (macro or weighted average, specified by parameter in f1_score) of L
f1 scores.
The GridSearchCV
then choose the best averaged f1 scores among 4*2*4 sets of parameters, which is 0.85 in this example.
Though it is convenient to use the wrapper for multi-label problem, it can only maximize the averaged f1 score with a same set of parameters used to build L
classifiers. If one wants to optimize the performance of each label separately, one seems to have to build L
classifiers without using the wrapper.
As for your second question, you might want to used GridSearchCV
with scikit-multilearn's BinaryRelevance classifier. Like OneVsRestClassifier
, Binary Relevance creates L single-label classifiers, one per label. For each label the training data is 1 if label is present and 0 if not present. The best selected classifier set is the BinaryRelevance
class instance in best_estimator_
property of GridSearchCV
. Use for predicting floats of probabilities use the predict_proba
method of the BinaryRelevance
object. An example can be found in the scikit-multilearn docs for model selection.
In your case I would run the following code:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.model_selection import GridSearchCV
import sklearn.metrics
model_to_set = BinaryRelevance(SVC())
parameters = {
"classifier__estimator__C": [1,2,4,8],
"classifier__estimator__kernel": ["poly","rbf"],
"classifier__estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
scoring='f1')
model_tunning.fit(X, y)
# for some X_test testing set
predictions = model_tunning.best_estimator_.predict(X_test)
# average=None gives per label score
metrics.f1_score(y_test, predictions, average = None)
Please note that there much better methods for multi-label classification than Binary Relevance :) You can find them in madjarov's comparison or my recent paper.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With