Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Smote with Gridsearchcv in Scikit-learn

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv. My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do. The validation set should not be oversampled. Am I right that the whole pipeline will be applied to both dataset splits? And if yes, how can I turn around this? Thanks a lot in advance

like image 213
Ehsan M Avatar asked May 09 '18 04:05

Ehsan M


People also ask

Can smote be used in pipeline?

In the above code snippet, we've used SMOTE as a part of a pipeline. This pipeline is not a 'Scikit-Learn' pipeline, but 'imblearn' pipeline. Since, SMOTE doesn't have a 'fit_transform' method, we cannot use it with 'Scikit-Learn' pipeline.

Do we apply smote on test set?

If you are going to use SMOTE, it should only be applied to the training data. This is because you are using SMOTE to gain an improvement in operational performance, and both the validation and test sets are there to provide an estimate of operational performance.

How does Sklearn GridSearchCV work?

GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.

Should I use GridSearchCV?

In summary, you should only use gridsearch on the training data after doing the train/test split, if you want to use the performance of the model on the test set as a metric for how your model will perform when it really does see new data. Save this answer.


1 Answers

Yes, it can be done, but with imblearn Pipeline.

You see, imblearn has its own Pipeline to handle the samplers correctly. I described this in a similar question here.

When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer. You can confirm that by looking at the source code here:

        if hasattr(transform, "fit_sample"):             pass         else:             Xt = transform.transform(Xt) 

So for this to work correctly, you need the following:

from imblearn.pipeline import Pipeline model = Pipeline([         ('sampling', SMOTE()),         ('classification', LogisticRegression())     ])  grid = GridSearchCV(model, params, ...) grid.fit(X, y) 

Fill the details as necessary, and the pipeline will take care of the rest.

like image 187
Vivek Kumar Avatar answered Sep 22 '22 21:09

Vivek Kumar