data imbalance in SVM using libSVM

Question

How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.

If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.

lejlot · Accepted Answer

Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)

Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.

data imbalance in SVM using libSVM

Tags:

machine-learning

svm

libsvm

Arturo

1 Answers

lejlot

Recent Activity

Donate For Us

data imbalance in SVM using libSVM

Tags:

machine-learning

svm

libsvm

Arturo

1 Answers

lejlot

Related questions

Recent Activity

Donate For Us