Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Class weights vs under/oversampling

In imbalanced classification (with scikit-learn) what would be the difference of balancing classes (i.e. set class_weight to balanced) to oversampling with SMOTE for example? What would be the expected effects of one vs the other?

like image 868
Mario L Avatar asked Apr 12 '19 18:04

Mario L


People also ask

Is it better to oversample or Undersample?

Oversampling methods duplicate or create new synthetic examples in the minority class, whereas undersampling methods delete or merge examples in the majority class. Both types of resampling can be effective when used in isolation, although can be more effective when both types of methods are used together.

When should you oversample?

In extreme cases where the number of observations in the rare class(es) is really small, oversampling is better, as you will not lose important information on the distribution of the other classes in the dataset.

What is the problem with oversampling?

It doesn't lead to any loss of information, and in some cases, may perform better than undersampling. But oversampling isn't perfect either. Because oversampling often involves replicating minority events, it can lead to overfitting.

Is oversampling good in machine learning?

Random Oversampling For Machine Learning algorithms affected by skewed distribution, such as artificial neural networks and SVMs, this is a highly effective technique.


1 Answers

Class weights directly modify the loss function by giving more (or less) penalty to the classes with more (or less) weight. In effect, one is basically sacrificing some ability to predict the lower weight class (the majority class for unbalanced datasets) by purposely biasing the model to favor more accurate predictions of the higher weighted class (the minority class).

Oversampling and undersampling methods essentially give more weight to particular classes as well (duplicating observations duplicates the penalty for those particular observations, giving them more influence in the model fit), but due to data splitting that typically takes place in training this will yield slightly different results as well.

Please refer to https://datascience.stackexchange.com/questions/52627/why-class-weight-is-outperforming-oversampling

like image 52
Constanza Garcia Avatar answered Oct 02 '22 02:10

Constanza Garcia