I have a question regarding GridSearchCV
:
by using this:
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=6, scoring="f1")
I specify that k-fold cross-validation should be used with 6 folds right?
So that means that my corpus is split into training set and tet set 6 times.
Doesn't that mean that for the GridSearchCV
I need to use my entire corpus, like so:
gs_clf = gs_clf.fit(corpus.data, corpus.target)
And if so, how would I then get my trainig set from there used for the predict method?
predictions = gs_clf.predict(??)
I have seen code where the corpus is split into test set and training set using train_test_split
and then X_train
and Y_train
are passed to gs_clf.fit
.
But that doesn't make sense to me: If I split it the corpus beforehand, why use cross validation again in the GridSearchCV
?
Thanks for some clarification!!
GridSearchCV
is not designed for measuring the performance of your model but to optimize the hyper-parameter of classifier while training. And when you write gs_clf.fit
you are actually trying different models on your entire data (but different folds) in the pursuit of the best hyper-parameter. For example, if you have n different c
's and m different gamma
's for an SVM model, then you have n X m models and you are searching (grid-search) through them to see which one works best on your data.gs_clf.best_params_
, then you can use your test data to get the actual performance (e.g., accuracy, precision, ...) of your model.corpus.train
and corpus.test
, and you should reserve corpus.test
only for the last round when you are done with training and you only want to test the final model.As we all know, any use of test data in the process of training the model (where training data should be used) or tuning the hyper-parameters (where the validation data should be used) is considered cheating and results in unrealistic performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With