I am doing binary classification and my current target class is composed of: Bad: 3126 Good:25038
So I want the number of Bad (minority) examples to equal the number of Good examples (1:1). So Bad needs to increase by ~8x (extra 21912 SMOTEd instances) and not increase the majority (Good). The code I am trying will not keep the number of Good constant, as currently.
Code I have tried:
Example 1:
library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=0, k=5, learner=NULL)
Example 1 output: Bad:25008 Good:0
Example 2:
smoted_data <- SMOTE(targetclass~., data, perc.over=700, k=5, learner=NULL)
Example 2 output: Bad: 25008 Good:43764
Example 3:
smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=100, k=5, learner=NULL)
Example 3 output: Bad: 25008 Good: 21882
To achieve a 1:1 balance using SMOTE
, you want to do this:
library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=100)
I have to admit it doesn't seem obvious from the built-in documentation, but if you read the original documentation, it states:
The parameters
perc.over
andperc.under
control the amount of over-sampling of the minority class and under-sampling of the majority classes, respectively.
perc.over
will typically be a number above 100. For each case in the orginal data set belonging to the minority class,perc.over/100
new examples of that class will be created. If perc.over is a value below 100 than a single case will be generated for a randomly selected proportion (given by perc.over/100) of the cases belonging to the minority class on the original data set.
So when perc.over
is 100, you essentially creating 1 new example (100/100 = 1).
The default of perc.under
is 200, and that is what you want to keep.
The parameter perc.under controls the proportion of cases of the majority class that will be randomly selected for the final "balanced" data set. This proportion is calculated with respect to the number of newly generated minority class cases.
prop.table(table(smoted_data$targetclass))
# returns 0.5 0.5
You can try using the ROSE package in R.
A research article with example is available here
You shoud use a perc.under of 114.423. Since (700/100)x3126x(114.423/100)=25038.04.
But note that since SMOTE does a random undersampling for the majority class, this way you would get a new data with duplicates in the majority class. That is to say, your new data will have 25038 GOOD samples, but they are not the same 25038 GOOD samples with the original data. Some GOOD samples will not be included and some will be duplicated in the newly generated data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With