Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add oversampling/undersampling procedure in scikit's Pipeline?

I would like to add oversampling procedure, like SMOTE oversampling, to scikit's Pipeline. But the transformers only supports fit and transform method, and do not provide a way to increase the number of samples and targets.

One possible way to do this is to break the pipeline to two separate pipelines connected by SMOTE sampling.

Is there any better solutions?

like image 760
Shuai Zhang Avatar asked Mar 29 '15 14:03

Shuai Zhang


People also ask

Can smote be used in pipeline?

In the above code snippet, we've used SMOTE as a part of a pipeline. This pipeline is not a 'Scikit-Learn' pipeline, but 'imblearn' pipeline. Since, SMOTE doesn't have a 'fit_transform' method, we cannot use it with 'Scikit-Learn' pipeline.

How do you oversample and Undersample?

Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset. Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset.

Why is smote better than oversampling?

Imbalanced data is a problem when creating a predictive machine learning model. One way to alleviate this problem is by oversampling the minority data. Instead of oversampling by replicating the data, we can oversample the data by creating synthetic data using the SMOTE technique.


1 Answers

Our current Pipeline does not support changing the number of samples between steps as the Transformer.transform method does not return the y argument that would need to also be resampled. This is a know limitation of the current design. It might be fixed in a future version but we have not started to work on that yet.

like image 122
ogrisel Avatar answered Nov 15 '22 08:11

ogrisel