I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using <code>imblearn</code>. My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when using <code>Pipeline</code> in the setup below? <pre class="prettyprint"><code>from imblearn.pipeline import Pipeline model = Pipeline([ ('sampling', SMOTE()), ('classification', xgb.XGBClassifier()) ]) param_dist = {'classification__n_estimators': stats.randint(50, 500), 'classification__learning_rate': stats.uniform(0.01, 0.3), 'classification__subsample': stats.uniform(0.3, 0.6), 'classification__max_depth': [3, 4, 5, 6, 7, 8, 9], 'classification__colsample_bytree': stats.uniform(0.5, 0.5), 'classification__min_child_weight': [1, 2, 3, 4], 'sampling__ratio': np.linspace(0.25, 0.5, 10) } random_search = RandomizedSearchCV(model, param_dist, cv=StratifiedKFold(n_splits=5), n_iter=10, scoring=scorer_cv_cost_savings) random_search.fit(X_train.values, y_train) </code></pre>

Your understanding is right. When you feed the <code>pipeline</code> as <code>model</code>, the training data <code>(k-1)</code> is applied using <code>.fit()</code> and testing is done on the <code>k</code>th fold. Then sampling would be done on the training data. The documentation for imblearn.pipeline <code>.fit()</code> says: <blockquote> Fit the model Fit all the transforms/samplers one after the other and transform/sample the data, then fit the transformed/sampled data using the final estimator. </blockquote>

Does oversampling happen before or after cross-validation using imblearn pipelines?

Tags:

python-3.x

scikit-learn

xgboost

imblearn

I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using imblearn.

My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when using Pipeline in the setup below?

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', xgb.XGBClassifier())
    ])


param_dist = {'classification__n_estimators': stats.randint(50, 500),
              'classification__learning_rate': stats.uniform(0.01, 0.3),
              'classification__subsample': stats.uniform(0.3, 0.6),
              'classification__max_depth': [3, 4, 5, 6, 7, 8, 9],
              'classification__colsample_bytree': stats.uniform(0.5, 0.5),
              'classification__min_child_weight': [1, 2, 3, 4],
              'sampling__ratio': np.linspace(0.25, 0.5, 10)
             }

random_search = RandomizedSearchCV(model,
                                   param_dist,
                                   cv=StratifiedKFold(n_splits=5),
                                   n_iter=10,
                                   scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)

789

asked May 06 '19 20:05

TomNash

1 Answers

Your understanding is right. When you feed the pipeline as model, the training data (k-1) is applied using .fit() and testing is done on the kth fold. Then sampling would be done on the training data.

The documentation for imblearn.pipeline .fit() says:

Fit the model

Fit all the transforms/samplers one after the other and transform/sample the data, then fit the transformed/sampled data using the final estimator.

192

answered Sep 27 '22 01:09

Venkatachalam

Related questions
                            
                                how to handle async error in ib_insync with python3.7?
                            
                                Context Manager for Popen
                            
                                Optimizing my large data code with little RAM
                            
                                Graph reduction
                            
                                How can I specify the flatten layer input size after many conv layers in PyTorch?
                            
                                Concurrency and Selenium - Multiprocessing vs Multithreading
                            
                                Retrieving python 3.6 handling of re.sub() with zero length matches in python 3.7
                            
                                Is there a way to use Kivy with OpenGL 1.1?
                            
                                Is there a way to interrupt shutil copytree operation in Python?
                            
                                Removing multiple phrases from string column efficiently
                            
                                Applying a filter on an image with Python
                            
                                How to Insert a Node between another node in a Linked List?
                            
                                joining output from regex search
                            
                                Python: Calculating frequency over time from a wav file in Python?
                            
                                Detect if python program is executed via Windows GUI (double-click) vs command prompt
                            
                                Is it possible to add undirected and directed edges to a graph object in networkx?
                            
                                Reading Gmail Email in Python
                            
                                How to send and consume json messages using confluent-kafka in Python
                            
                                Kubernetes Python client connection Issue
                            
                                Group nodes together in networkx

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With