Nested cross-validation: How does cross_validate handle GridSearchCV as its input estimator?

Tags:

The following code combines cross_validate with GridSearchCV to perform a nested cross-validation for an SVC on the iris dataset.

(Modified example of the following documentation page: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np
np.set_printoptions(precision=2)

# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10],
          "gamma": [.01, .1]}

# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")

# Choose techniques for the inner and outer loop of nested cross-validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)

# Perform nested cross-validation
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
clf.fit(X_iris, y_iris)
best_estimator = clf.best_estimator_

cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
mean_val_score = cv_dic['test_accuracy'].mean()

print('nested_train_scores: ', cv_dic['train_accuracy'])
print('nested_val_scores:   ', cv_dic['test_accuracy'])
print('mean score:            {0:.2f}'.format(mean_val_score))

cross_validate splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf, a parameterized GridSearchCV estimator, i.e. an estimator that cross-validates itself again.

I have three questions about the whole thing:

If clf is used as the estimator for cross_validate, does it (in the course of the GridSearchCV cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Out of all models tested via GridSearchCV, does cross_validate validate only the model stored in the best_estimator_ attribute?
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?

To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.

enter image description here

399

asked Mar 06 '19 18:03

zwithouta

Video Answer

1 Answers

If clf is used as the estimator for cross_validate, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?

Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).

Update Yes, when you will pass the GridSearchCV classifier into cross-validate it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.

Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?

Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit parameter is True by default in your case.) ~~However, this best estimator will has to be trained again~~

Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?

As per your third and final question, Yes, it trains an estimator and returns it if return_estimator is set to True. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?

Update The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV and using the best estimator. However, there is no way for cross-validate to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.

194

answered Oct 02 '22 03:10

Gambit1614

Related questions
                            
                                Python3 function definition, arrow and colon [duplicate]
                            
                                How to run SQLAlchemy on AWS Lambda in Python
                            
                                Mocking a class used in a with statement
                            
                                WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally with ChromeDriver Chrome and Selenium through Python on VPS
                            
                                I use QDoubleValidator in my pyqt5 program but it doesn't seem to work
                            
                                How to add or remove a specific element from a numpy 2d array?
                            
                                how can I combine multiple sparse and dense matrices together
                            
                                Get list of running windows applications using python
                            
                                What does 'quantization' mean in interpreter.get_input_details()?
                            
                                removing "y" immediately before or after any vowel in a string in python
                            
                                PyTorch Huggingface BERT-NLP for Named Entity Recognition
                            
                                Pandas Pivot Creating NaN
                            
                                multiple softmax classifications (Keras)
                            
                                Running two python scripts with bash file
                            
                                Opposite function to utcfromtimestamp?
                            
                                Module pytz: UTC decrease instead of increase
                            
                                How to get text objects to work with sklearn classifier pipeline?
                            
                                scipy.integrate.solve_ivp vectorized
                            
                                Calling a REST API from django view
                            
                                Return value of __exit__

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Nested cross-validation: How does cross_validate handle GridSearchCV as its input estimator?

Tags:

python

python-3.x

nested

scikit-learn

cross-validation

zwithouta

People also ask

Video Answer

1 Answers

Gambit1614

Recent Activity

Donate For Us