As part of the Enron project, built the attached model, Below is the summary of the steps,
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.predict(x_test)
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.fit(x_train,y_train)
gcv.best_estimator_.predict(x_test)
Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.
Used SVM with a GridSearch using a StratifiedShuffle
Used the best_estimator_ to predict and calculate the precision and recall.
The problem is estimator is spitting out perfect scores, in some case 1
But when I refit the best classifier on training data then run the test it gives reasonable scores.
My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.
So, is using the same Shufflesplit for two wrong?
GridSearchCV as @Gauthier Feuillen said is used to search best parameters of an estimator for given data. Description of GridSearchCV:-
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params
will be expanded to get all possible combinations separate using ParameterGrid.features
will now be split into features_train
and features_test
using cv
. Same for labels
features_train
and labels_inner
and scored using features_test
and labels_test
.cv_iterations
. The average of score across cv iterations will be calculated, which will be assigned to that parameter combination. This can be accessed using cv_results_
attribute of gridSearch.Because of last step, you are getting different scores in first and second approach. Because in the first approach, all data is used for training and you are predicting for that data only. Second approach has prediction on previously unseen data.
Basically the grid search will:
So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With