Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV?

What I want to do is to derive a classifier which is optimal in its parameters with respect to a given metric (for example the recall score) but also calibrated (in the sense that the output of the predict_proba method can be directly interpreted as a confidence level, see https://scikit-learn.org/stable/modules/calibration.html). Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV, that is, to fit a classifier via GridSearchCV, and then pass the GridSearchCV output to the CalibratedClassifierCV object? If I'm correct, the CalibratedClassifierCV object would fit a given estimator cv times, and the probabilities for each of the folds are then averaged for prediction. However, the results of the GridSearchCV could be different for each of the folds.

like image 693
MS91 Avatar asked Feb 17 '20 14:02

MS91


1 Answers

Yes you can do this and it would work. I don't know if it makes sense to do this, but the least I can do is explain what I believe would happen.

We can compare doing this to the alternative which is getting the best estimator from the grid search and feeding that to the calibration.

  1. Simply getting the best estimator and feeding it to calibrationcv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
calibration_clf = CalibratedClassifierCV(clf.best_estimator_)
calibration_clf.fit(iris.data, iris.target)
calibration_clf.predict_proba(iris.data[0:10])

array([[0.91887427, 0.07441489, 0.00671085],
       [0.91907451, 0.07417992, 0.00674558],
       [0.91914982, 0.07412815, 0.00672202],
       [0.91939591, 0.0738401 , 0.00676399],
       [0.91894279, 0.07434967, 0.00670754],
       [0.91910347, 0.07414268, 0.00675385],
       [0.91944594, 0.07381277, 0.0067413 ],
       [0.91903299, 0.0742324 , 0.00673461],
       [0.91951618, 0.07371877, 0.00676505],
       [0.91899007, 0.07426733, 0.00674259]])

  1. Feeding grid search in the Calibration cv

from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
cal_clf = CalibratedClassifierCV(clf)
cal_clf.fit(iris.data, iris.target)
cal_clf.predict_proba(iris.data[0:10])

array([[0.900434  , 0.0906832 , 0.0088828 ],
       [0.90021418, 0.09086583, 0.00891999],
       [0.90206035, 0.08900572, 0.00893393],
       [0.9009212 , 0.09012478, 0.00895402],
       [0.90101953, 0.0900889 , 0.00889158],
       [0.89868497, 0.09242412, 0.00889091],
       [0.90214948, 0.08889812, 0.0089524 ],
       [0.8999936 , 0.09110965, 0.00889675],
       [0.90204193, 0.08896843, 0.00898964],
       [0.89985101, 0.09124147, 0.00890752]])

Notice that the output of the probabilities are slightly different between the two.

The difference between each method is:

  1. Using the best estimator is only doing the calibration across 5 splits (the default cv). It uses the same estimator in all 5 splits.

  2. Using grid search, is doing going to fit a grid search on each of the 5 CV splits from calibration 5 times. You are essentially doing cross validation on 4/5 of the data each time choosing the best estimator for the 4/5 of the data and then doing the calibration with that best estimator on the last 5th. You could have slightly different models running on each set of test data depending on what the grid search chooses.

I think the grid search and calibration are different goals so in my opinion I would probably work on each separately and go with the first way specified above get a model that works the best and then feed that in the calibration curve.

However, I don't know your specific goals so I can't say that the 2nd way described here is the WRONG way. You can always try both ways and see what gives you better performance and go with the one that works best.

like image 124
jawsem Avatar answered Sep 22 '22 12:09

jawsem