I am dealing with multi-label classification with OneVsRestClassifier and SVC,
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
L=3
X, y = make_multilabel_classification(n_classes=L, n_labels=2,
allow_unlabeled=True,
random_state=1, return_indicator=True)
model_to_set = OneVsRestClassifier(SVC())
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly","rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
scoring='f1')
model_tunning.fit(X, y)
print model_tunning.best_score_
print model_tunning.best_params_
#0.855175822314
#{'estimator__kernel': 'poly', 'estimator__C': 1, 'estimator__degree': 3}
1st question
What is the number 0.85 representing for? Is it the best score among the L classifiers or the averaged one? Similarly, does the set of parameters stand for the best-scorer among L classifiers?
2nd question
Based on the fact that, if I am right, the OneVsRestClassifier literally builds L classifiers for each label, one can expect to access or observe the performance of EACH LABEL. But how, in the above example, to obtain L scores from the GridSearchCV object?
EDIT
To simplify the problem and help myself understand more about OneVsRestClassifier, before tuning model,
model_to_set.fit(X,y)
gp = model_to_set.predict(X) # the "global" prediction
fp = model_to_set.estimators_[0].predict(X) # the first-class prediction
sp = model_to_set.estimators_[1].predict(X) # the second-class prediction
tp = model_to_set.estimators_[2].predict(X) # the third-class prediction
It can be shown that gp.T[0]==fp, gp.T[1]==sp and gp.T[2]==tp. So the "global" prediction is simply the 'sequential' L individual predictions and the 2nd question is solved.
But it is still confusing for me that if one meta-classifier OneVsRestClassifier contains L classifiers, how can GridSearchCV returns only ONE best score, corresponding to one of 4*2*4 sets of parameters, for a meta-classifier OneVsRestClassifier having L classifiers?
It would be fairly appreciated to see any comment.
Adapted algorithm This technique uses adaptive algorithms, which are used to perform multi-label classification rather than conducting problem transformation directly. In Scikit-multilearn, we have multi-label-k-nearest-neighbor (MLkNN), which is used to handle multi-label classification.
One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for using binary classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary classification problems.
Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.
Multi-label classification involves predicting zero or more class labels. Unlike normal classification tasks where class labels are mutually exclusive, multi-label classification requires specialized machine learning algorithms that support predicting multiple mutually non-exclusive classes or “labels.”
GridSearchCV creates grid from your parameter values, it evaluates your OneVsRestClassifier as atomic classifier (I.e. GridSearchCV doesn't know what is inside this metaclassifier)
First: 0.85 is the best score of OneVsRestClassifier among all possible combinations (16 combinations in your case, 4*2*4) of parameters ("estimator__C", "estimator__kernel", "estimator__degree"), it means that GridSearchCV evaluates 16 (again, it's only in this particular case) possible OneVsRestClassifier's each of which contains L SVC's. All of that L classifiers inside one OneVsRestClassifier have same values of parameters (but each of them is learning to recognize their own class from L possible)
i.e. from set of
{OneVsRestClassifier(SVC(C=1, kernel="poly", degree=1)),
OneVsRestClassifier(SVC(C=1, kernel="poly", degree=2)),
...,
OneVsRestClassifier(SVC(C=8, kernel="rbf", degree=3)),
OneVsRestClassifier(SVC(C=8, kernel="rbf", degree=4))}
it chooses one with the best score.
model_tunning.best_params_ here represents parameters for OneVsRestClassifier(SVC()) with which it will achieve model_tunning.best_score_.
You can get that best OneVsRestClassifier from model_tunning.best_estimator_ attribute.
Second: There is no ready to use code to obtain separate scores for L classifiers from OneVsRestClassifier, but you can look at implementation of OneVsRestClassifier.fit method, or take this (should work :) ):
# Here X, y - your dataset
one_vs_rest = model_tunning.best_estimator_
yT = one_vs_rest.label_binarizer_.transform(y).toarray().T
# Iterate through all L classifiers
for classifier, is_ith_class in zip(one_vs_rest.estimators_, yT):
print(classifier.score(X, is_ith_class))
Inspired by @Olologin 's answer, I realized that 0.85 is the best weighted average of f1 scores (in this example) obtained by L predictions. In the following code, I evaluate the model by inner test, using macro average of f1 score:
# Case A, inspect F1 score using the meta-classifier
F_A = f1_score(y, model_tunning.best_estimator_.predict(X), average='macro')
# Case B, inspect F1 scores of each label (binary task) and collect them by macro average
F_B = []
for label, clc in zip(y.T, model_tunning.best_estimator_.estimators_):
F_B.append(f1_score(label, clf.predict(X)))
F_B = mean(F_B)
F_A==F_B # True
So it implies that the GridSearchCV applies one of 4*2*4 sets of parameters to build the meta-classifier which in turn makes prediction on each label with one of the L classifiers. The outcome will be L f1 scores for L labels, each of which is a performance of a binary task. Finally, a single score is obtained by taking average (macro or weighted average, specified by parameter in f1_score) of L f1 scores.
The GridSearchCV then choose the best averaged f1 scores among 4*2*4 sets of parameters, which is 0.85 in this example.
Though it is convenient to use the wrapper for multi-label problem, it can only maximize the averaged f1 score with a same set of parameters used to build L classifiers. If one wants to optimize the performance of each label separately, one seems to have to build L classifiers without using the wrapper.
As for your second question, you might want to used GridSearchCV with scikit-multilearn's BinaryRelevance classifier. Like OneVsRestClassifier, Binary Relevance creates L single-label classifiers, one per label. For each label the training data is 1 if label is present and 0 if not present. The best selected classifier set is the BinaryRelevance class instance in best_estimator_ property of GridSearchCV. Use for predicting floats of probabilities use the predict_proba method of the BinaryRelevance object. An example can be found in the scikit-multilearn docs for model selection.
In your case I would run the following code:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.model_selection import GridSearchCV
import sklearn.metrics
model_to_set = BinaryRelevance(SVC())
parameters = {
"classifier__estimator__C": [1,2,4,8],
"classifier__estimator__kernel": ["poly","rbf"],
"classifier__estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
scoring='f1')
model_tunning.fit(X, y)
# for some X_test testing set
predictions = model_tunning.best_estimator_.predict(X_test)
# average=None gives per label score
metrics.f1_score(y_test, predictions, average = None)
Please note that there much better methods for multi-label classification than Binary Relevance :) You can find them in madjarov's comparison or my recent paper.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With