If I am training a SVM on a lrge training set and if the class variable is either True or False, would having very few True values compared to he number of False values in the training set affect the training model/results? Should they be equal? If my training set doesn't have an equal distribution of True and False, how do I take care of this such that my training is done as efficiently as possible?
It's fine to have imbalanced data, because the SVM should be able to assign a greater penalty to misclassification errors related with the less likely instance (e.g. "True" in your case), rather than assign equal error weight which results in the undesirable classifier that assigns everything to the majority. However, you'll probably get better results with balanced data. It all depends on your data, really.
You could skew the data artificially to get more balanced data. Why don't you check this paper: http://pages.stern.nyu.edu/~fprovost/Papers/skew.PDF.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With