Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to balance unbalanced classification 1:1 with SMOTE in R

I am doing binary classification and my current target class is composed of: Bad: 3126 Good:25038

So I want the number of Bad (minority) examples to equal the number of Good examples (1:1). So Bad needs to increase by ~8x (extra 21912 SMOTEd instances) and not increase the majority (Good). The code I am trying will not keep the number of Good constant, as currently.

Code I have tried:

Example 1:

library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=0, k=5, learner=NULL)

Example 1 output: Bad:25008 Good:0

Example 2:

smoted_data <- SMOTE(targetclass~., data, perc.over=700, k=5, learner=NULL)

Example 2 output: Bad: 25008 Good:43764

Example 3:

smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=100, k=5, learner=NULL)

Example 3 output: Bad: 25008 Good: 21882

like image 407
CJava Avatar asked Apr 15 '16 15:04

CJava


3 Answers

To achieve a 1:1 balance using SMOTE, you want to do this:

library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=100)

I have to admit it doesn't seem obvious from the built-in documentation, but if you read the original documentation, it states:

The parameters perc.over and perc.under control the amount of over-sampling of the minority class and under-sampling of the majority classes, respectively.

perc.over will typically be a number above 100. For each case in the orginal data set belonging to the minority class, perc.over/100 new examples of that class will be created. If perc.over is a value below 100 than a single case will be generated for a randomly selected proportion (given by perc.over/100) of the cases belonging to the minority class on the original data set.

So when perc.over is 100, you essentially creating 1 new example (100/100 = 1).

The default of perc.under is 200, and that is what you want to keep.

The parameter perc.under controls the proportion of cases of the majority class that will be randomly selected for the final "balanced" data set. This proportion is calculated with respect to the number of newly generated minority class cases.

prop.table(table(smoted_data$targetclass))
# returns 0.5  0.5
like image 179
onlyphantom Avatar answered Sep 29 '22 07:09

onlyphantom


You can try using the ROSE package in R.

A research article with example is available here

like image 30
Arvind Avatar answered Sep 29 '22 07:09

Arvind


You shoud use a perc.under of 114.423. Since (700/100)x3126x(114.423/100)=25038.04.

But note that since SMOTE does a random undersampling for the majority class, this way you would get a new data with duplicates in the majority class. That is to say, your new data will have 25038 GOOD samples, but they are not the same 25038 GOOD samples with the original data. Some GOOD samples will not be included and some will be duplicated in the newly generated data.

like image 45
Yan Avatar answered Sep 29 '22 09:09

Yan