I'm looking for recommendations as to the best way forward for my current machine learning problem
The outline of the problem and what I've done is as follows:
You can find a shortened version of the code here: http://pastebin.com/Xu13ciL4
My issues:
Now, this seems like a classic case of overfitting here. However, overfitting here is unlikely to be caused by a disproportionate number of features to samples (32 features, 900 samples). I've tried a number of things to alleviate this problem:
I'm happy to keep thinking about the problem but at this point I'm looking for a nudge in the right direction. Where might my problem be and what could I do to solve it?
It's entirely possible that my set of features just don't distinguish between the 2 categories, but I'd like to try some other options before jumping to this conclusion. Furthermore, if my features don't distinguish then that would explain the low test set scores, but how do you get a perfect training set score in that case? Is that possible?
I would first try a grid search over the parameter space but while also using a k-fold cross-validation on training set (and keeping the test set to the side of course). Then pick the set of parameters than generalize the best from the k-fold cross validation. I suggest using GridSearchCV with StratifiedKFold (it's already the default strategy for GridSearchCV when passing a classifier as estimator).
Hypothetically an SVM with rbf can perfectly fit any training set as VC dimension is infinite. So if tuning the parameters doesn't help reduce overfitting then you may want to try a similar parameter tuning strategy for a simpler hypothesis such as a linear SVM or another classifier you think may be appropriate for your domain.
Regularization as you mentioned is definitely a good idea if its available.
The prediction of the same label makes me think that label imbalance may be an issue and for this case you could use different class weights. So in the case of an SVM each class gets its own C penalty weight. Some estimators in sklearn accept fit params that allow you to set a sample weights to set the amount of penalty for individual training samples.
Now if you think the features may be an issue I would use feature selection by looking at F-values provided by f_classif and could be use with something like SelectKBest. Another option would be recursive feature elimination with cross validation. Feature selection can be wrapped into a grid search as well if you use sklearns Pipeline API.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With