Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does oversampling happen before or after cross-validation using imblearn pipelines?

I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using imblearn.

My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when using Pipeline in the setup below?

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', xgb.XGBClassifier())
    ])


param_dist = {'classification__n_estimators': stats.randint(50, 500),
              'classification__learning_rate': stats.uniform(0.01, 0.3),
              'classification__subsample': stats.uniform(0.3, 0.6),
              'classification__max_depth': [3, 4, 5, 6, 7, 8, 9],
              'classification__colsample_bytree': stats.uniform(0.5, 0.5),
              'classification__min_child_weight': [1, 2, 3, 4],
              'sampling__ratio': np.linspace(0.25, 0.5, 10)
             }

random_search = RandomizedSearchCV(model,
                                   param_dist,
                                   cv=StratifiedKFold(n_splits=5),
                                   n_iter=10,
                                   scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)
like image 789
TomNash Avatar asked May 06 '19 20:05

TomNash


People also ask

Would you do the train test split before or after the oversampling for data with an imbalanced class?

All Answers (2) Always split into test and train sets BEFORE trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets.

When should smote be applied?

1 Answer. Show activity on this post. If you are going to use SMOTE, it should only be applied to the training data. This is because you are using SMOTE to gain an improvement in operational performance, and both the validation and test sets are there to provide an estimate of operational performance.

Can smote be used in pipeline?

Since, SMOTE doesn't have a 'fit_transform' method, we cannot use it with 'Scikit-Learn' pipeline.

Should smote be used before train test split?

SMOTE does not take into account neighboring examples from other classes when generating synthetic examples. This could result in more class overlap and noise. This is especially bad if you have a high-dimensional dataset. So the answer is you definitely should not with SMOTE.


1 Answers

Your understanding is right. When you feed the pipeline as model, the training data (k-1) is applied using .fit() and testing is done on the kth fold. Then sampling would be done on the training data.

The documentation for imblearn.pipeline .fit() says:

Fit the model

Fit all the transforms/samplers one after the other and transform/sample the data, then fit the transformed/sampled data using the final estimator.

like image 192
Venkatachalam Avatar answered Sep 27 '22 01:09

Venkatachalam