SMOTE oversampling and cross-validation

Question

I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%.

Is this due to oversampling? Is it bad practice to perform cross-validation on data on which SMOTE is applied? Are there any ways to solve this problem?

jairojimenez · Accepted Answer

I think you should split the data on test and training first, then perform SMOTE just on the training part, and then test the algorithm on the part of the dataset that doesn't have synthetic examples, that'll give you a better picture of the performance of the algorithm.

SMOTE oversampling and cross-validation

Tags:

machine-learning

text-classification

weka

kverr

1 Answers

jairojimenez

Recent Activity

Donate For Us

SMOTE oversampling and cross-validation

Tags:

machine-learning

text-classification

weka

kverr

1 Answers

jairojimenez

Related questions

Recent Activity

Donate For Us