I am trying to build a classifier to predict breast cancer using the UCI dataset. I am using support vector machines. Despite my most sincere efforts to improve upon the accuracy of the classifier, I cannot get beyond 97.062%. I've tried the following:
1. Finding the most optimal C and gamma using grid search.
2. Finding the most discriminative feature using F-score.
Can someone suggest me techniques to improve upon the accuracy? I am aiming at at least 99%.
1.Data are already normalized to the ranger of [0,10]. Will normalizing it to [0,1] help?
2. Some other method to find the best C and gamma?
Figure 2. The non-linear optimal hyperplane, which support-vector machine (SVM) can provide as a classification tool. Several studies have investigated SVM as a diagnostic tool for AD, and a number have shown good levels of accuracy (5–8).
Different model parameters affect the prediction accuracy of SVM model differently. Training sample size can also influence the prediction accuracy of SVM model. The method of determining the optimal SVM regression model is summarized. Prediction accuracy of SVM model improves greatly by applying the method promoted.
To improve performance, you could iterate through these steps: Collect data: Increase the number of training examples. Feature processing: Add more variables and better feature processing. Model parameter tuning: Consider alternate values for the training parameters used by your learning algorithm.
For SVM, it's important to have the same scaling for all features and normally it is done through scaling the values in each (column) feature such that the mean is 0 and variance is 1. Another way is to scale it such that the min and max are for example 0 and 1. However, there isn't any difference between [0, 1] and [0, 10]. Both will show the same performance.
If you insist on using SVM for classification, another way that may result in improvement is ensembling multiple SVM. In case you are using Python, you can try BaggingClassifier
from sklearn.ensemble
.
Also notice that you can't expect to get any performance from a real set of training data. I think 97% is a very good performance. It is possible that you overfit the data if you go higher than this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With