Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GridSearchCV: How to specify test set?

I have a question regarding GridSearchCV:

by using this:

gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=6, scoring="f1")

I specify that k-fold cross-validation should be used with 6 folds right?

So that means that my corpus is split into training set and tet set 6 times.

Doesn't that mean that for the GridSearchCV I need to use my entire corpus, like so:

gs_clf = gs_clf.fit(corpus.data, corpus.target)

And if so, how would I then get my trainig set from there used for the predict method?

predictions = gs_clf.predict(??)

I have seen code where the corpus is split into test set and training set using train_test_split and then X_train and Y_train are passed to gs_clf.fit.

But that doesn't make sense to me: If I split it the corpus beforehand, why use cross validation again in the GridSearchCV?

Thanks for some clarification!!

like image 308
user3629892 Avatar asked Nov 11 '16 10:11

user3629892


1 Answers

  1. GridSearchCV is not designed for measuring the performance of your model but to optimize the hyper-parameter of classifier while training. And when you write gs_clf.fit you are actually trying different models on your entire data (but different folds) in the pursuit of the best hyper-parameter. For example, if you have n different c's and m different gamma's for an SVM model, then you have n X m models and you are searching (grid-search) through them to see which one works best on your data.
  2. When you found the best model using gs_clf.best_params_, then you can use your test data to get the actual performance (e.g., accuracy, precision, ...) of your model.
  3. Of course, only then it is time for testing the model. Your test data must not have any overlap with the data you trained your model against. For instance, you should have something like corpus.train and corpus.test, and you should reserve corpus.test only for the last round when you are done with training and you only want to test the final model.

As we all know, any use of test data in the process of training the model (where training data should be used) or tuning the hyper-parameters (where the validation data should be used) is considered cheating and results in unrealistic performance.

like image 79
Azim Avatar answered Nov 20 '22 08:11

Azim