Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SMOTE oversampling and cross-validation

I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%.

Is this due to oversampling? Is it bad practice to perform cross-validation on data on which SMOTE is applied? Are there any ways to solve this problem?

like image 579
kverr Avatar asked Aug 06 '15 12:08

kverr


1 Answers

I think you should split the data on test and training first, then perform SMOTE just on the training part, and then test the algorithm on the part of the dataset that doesn't have synthetic examples, that'll give you a better picture of the performance of the algorithm.

like image 141
jairojimenez Avatar answered Sep 23 '22 12:09

jairojimenez