What I want to do is to derive a classifier which is optimal in its parameters with respect to a given metric (for example the recall score) but also calibrated (in the sense that the output of the predict_proba method can be directly interpreted as a confidence level, see https://scikit-learn.org/stable/modules/calibration.html). Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV, that is, to fit a classifier via GridSearchCV, and then pass the GridSearchCV output to the CalibratedClassifierCV object? If I'm correct, the CalibratedClassifierCV object would fit a given estimator cv times, and the probabilities for each of the folds are then averaged for prediction. However, the results of the GridSearchCV could be different for each of the folds.
Yes you can do this and it would work. I don't know if it makes sense to do this, but the least I can do is explain what I believe would happen.
We can compare doing this to the alternative which is getting the best estimator from the grid search and feeding that to the calibration.
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
calibration_clf = CalibratedClassifierCV(clf.best_estimator_)
calibration_clf.fit(iris.data, iris.target)
calibration_clf.predict_proba(iris.data[0:10])
array([[0.91887427, 0.07441489, 0.00671085],
[0.91907451, 0.07417992, 0.00674558],
[0.91914982, 0.07412815, 0.00672202],
[0.91939591, 0.0738401 , 0.00676399],
[0.91894279, 0.07434967, 0.00670754],
[0.91910347, 0.07414268, 0.00675385],
[0.91944594, 0.07381277, 0.0067413 ],
[0.91903299, 0.0742324 , 0.00673461],
[0.91951618, 0.07371877, 0.00676505],
[0.91899007, 0.07426733, 0.00674259]])
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
cal_clf = CalibratedClassifierCV(clf)
cal_clf.fit(iris.data, iris.target)
cal_clf.predict_proba(iris.data[0:10])
array([[0.900434 , 0.0906832 , 0.0088828 ],
[0.90021418, 0.09086583, 0.00891999],
[0.90206035, 0.08900572, 0.00893393],
[0.9009212 , 0.09012478, 0.00895402],
[0.90101953, 0.0900889 , 0.00889158],
[0.89868497, 0.09242412, 0.00889091],
[0.90214948, 0.08889812, 0.0089524 ],
[0.8999936 , 0.09110965, 0.00889675],
[0.90204193, 0.08896843, 0.00898964],
[0.89985101, 0.09124147, 0.00890752]])
Notice that the output of the probabilities are slightly different between the two.
The difference between each method is:
Using the best estimator is only doing the calibration across 5 splits (the default cv). It uses the same estimator in all 5 splits.
Using grid search, is doing going to fit a grid search on each of the 5 CV splits from calibration 5 times. You are essentially doing cross validation on 4/5 of the data each time choosing the best estimator for the 4/5 of the data and then doing the calibration with that best estimator on the last 5th. You could have slightly different models running on each set of test data depending on what the grid search chooses.
I think the grid search and calibration are different goals so in my opinion I would probably work on each separately and go with the first way specified above get a model that works the best and then feed that in the calibration curve.
However, I don't know your specific goals so I can't say that the 2nd way described here is the WRONG way. You can always try both ways and see what gives you better performance and go with the one that works best.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With