I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using imblearn
.
My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when using Pipeline
in the setup below?
from imblearn.pipeline import Pipeline
model = Pipeline([
('sampling', SMOTE()),
('classification', xgb.XGBClassifier())
])
param_dist = {'classification__n_estimators': stats.randint(50, 500),
'classification__learning_rate': stats.uniform(0.01, 0.3),
'classification__subsample': stats.uniform(0.3, 0.6),
'classification__max_depth': [3, 4, 5, 6, 7, 8, 9],
'classification__colsample_bytree': stats.uniform(0.5, 0.5),
'classification__min_child_weight': [1, 2, 3, 4],
'sampling__ratio': np.linspace(0.25, 0.5, 10)
}
random_search = RandomizedSearchCV(model,
param_dist,
cv=StratifiedKFold(n_splits=5),
n_iter=10,
scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)
All Answers (2) Always split into test and train sets BEFORE trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets.
1 Answer. Show activity on this post. If you are going to use SMOTE, it should only be applied to the training data. This is because you are using SMOTE to gain an improvement in operational performance, and both the validation and test sets are there to provide an estimate of operational performance.
Since, SMOTE doesn't have a 'fit_transform' method, we cannot use it with 'Scikit-Learn' pipeline.
SMOTE does not take into account neighboring examples from other classes when generating synthetic examples. This could result in more class overlap and noise. This is especially bad if you have a high-dimensional dataset. So the answer is you definitely should not with SMOTE.
Your understanding is right. When you feed the pipeline
as model
, the training data (k-1)
is applied using .fit()
and testing is done on the k
th fold. Then sampling would be done on the training data.
The documentation for imblearn.pipeline .fit()
says:
Fit the model
Fit all the transforms/samplers one after the other and transform/sample the data, then fit the transformed/sampled data using the final estimator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With