Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GridSearchCV: performance metrics on a selected class [unbalanced data-set]

Is there a way to run a grid search over parameter values optimised for a score (e.g. 'f1') on a selected class, rather than the default score for all the classes?

[Edit] The assumption is that such a grid search should return a set of parameters maximising a score (e.g. 'f1', 'accuracy', 'recall') only for a selected class, rather than the overall score across all classes. Such an approach seems to be useful e.g. for highly unbalanced data-sets, when attempting to construct a classifier that does a reasonable job on a class with a small number of instances.

An example of a GridSearchCV with a default scoring approach (here: 'f1' over all the classes):

from __future__ import print_function

from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4, 1e-5],
                 'C': [1, 50, 100, 500, 1000, 5000]},
                {'kernel': ['linear'], 'C': [1, 100, 500, 1000, 5000]}]

clf = GridSearchCV(SVC(), tuned_parameters, cv=4, scoring='f1', n_jobs=-1)
clf.fit(X_train, y_train)

print("Best parameters set found on development set:")
print()
print(clf.best_estimator_)

y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))

How to optimise parameters for a best performance on a selected class, or incorporate a test of a range of class_weight in GridSearchCV?

like image 975
user3661230 Avatar asked Jul 30 '15 15:07

user3661230


People also ask

What is Param_grid in GridSearchCV?

param_grid – A dictionary with parameter names as keys and lists of parameter values. 3. scoring – The performance measure. For example, 'r2' for regression models, 'precision' for classification models.

Which of the following technique is used in grid search to find the optimal hyperparameters?

Grid search builds a model for every combination of hyperparameters specified and evaluates each model. A more efficient technique for hyperparameter tuning is the Randomized search — where random combinations of the hyperparameters are used to find the best solution.


2 Answers

Yes, you'll want to use the scoring parameter in GridSearchCV(). There are a handful of pre-built scoring functions you can reference via string (such as f1), the full list can be found here: http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values. Alternatively you can make your own custom scoring function with sklearn.metrics.make_scorer.

If that isn't enough detail for you post a reproducible example and we can plug this into some actual code.

like image 141
David Avatar answered Oct 18 '22 20:10

David


Scoring metrics that require additional parameters are not part of the pre-built scoring functions within grid search.

In this case, additional parameter required is to select the class for which scoring has to be done

You need to import make_scorer and fbeta_score from sklearn.metrics.

make_scorer converts metrics into callables that can be used for model evaluation

The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0

Parameters for F-beta

beta: beta < 1 lends more weight to precision, while beta > 1 favors recall, beta -> 0 considers only precision, while beta -> inf only recall

pos_label: specifies the class for which scoring needs to be done (str or int, 1 by default)

Code example is as below

from sklearn.metrics import make_scorer, fbeta_score

f2_score = make_scorer(fbeta_score, beta=2, pos_label=1)

clf = GridSearchCV(SVC(), tuned_parameters, cv=4, scoring=f2_score, n_jobs=-1)
like image 37
norman Avatar answered Oct 18 '22 20:10

norman