What is the relation between the number of Support Vectors and training data and classifiers performance? [closed]

Tags:

I am using LibSVM to classify some documents. The documents seem to be a bit difficult to classify as the final results show. However, I have noticed something while training my models. and that is: If my training set is for example 1000 around 800 of them are selected as support vectors. I have looked everywhere to find if this is a good thing or bad. I mean is there a relation between the number of support vectors and the classifiers performance? I have read this previous post but I am performing a parameter selection and also I am sure that the attributes in the feature vectors are all ordered. I just need to know the relation. Thanks. p.s: I use a linear kernel.

706

asked Feb 28 '12 10:02

Hossein

2 Answers

Support Vector Machines are an optimization problem. They are attempting to find a hyperplane that divides the two classes with the largest margin. The support vectors are the points which fall within this margin. It's easiest to understand if you build it up from simple to more complex.

Hard Margin Linear SVM

In a training set where the data is linearly separable, and you are using a hard margin (no slack allowed), the support vectors are the points which lie along the supporting hyperplanes (the hyperplanes parallel to the dividing hyperplane at the edges of the margin)

Hard-Margin SVM

All of the support vectors lie exactly on the margin. Regardless of the number of dimensions or size of data set, the number of support vectors could be as little as 2.

Soft-Margin Linear SVM

But what if our dataset isn't linearly separable? We introduce soft margin SVM. We no longer require that our datapoints lie outside the margin, we allow some amount of them to stray over the line into the margin. We use the slack parameter C to control this. (nu in nu-SVM) This gives us a wider margin and greater error on the training dataset, but improves generalization and/or allows us to find a linear separation of data that is not linearly separable.

Soft-margin Linear SVM

Now, the number of support vectors depends on how much slack we allow and the distribution of the data. If we allow a large amount of slack, we will have a large number of support vectors. If we allow very little slack, we will have very few support vectors. The accuracy depends on finding the right level of slack for the data being analyzed. Some data it will not be possible to get a high level of accuracy, we must simply find the best fit we can.

Non-Linear SVM

This brings us to non-linear SVM. We are still trying to linearly divide the data, but we are now trying to do it in a higher dimensional space. This is done via a kernel function, which of course has its own set of parameters. When we translate this back to the original feature space, the result is non-linear:

enter image description here

Now, the number of support vectors still depends on how much slack we allow, but it also depends on the complexity of our model. Each twist and turn in the final model in our input space requires one or more support vectors to define. Ultimately, the output of an SVM is the support vectors and an alpha, which in essence is defining how much influence that specific support vector has on the final decision.

Here, accuracy depends on the trade-off between a high-complexity model which may over-fit the data and a large-margin which will incorrectly classify some of the training data in the interest of better generalization. The number of support vectors can range from very few to every single data point if you completely over-fit your data. This tradeoff is controlled via C and through the choice of kernel and kernel parameters.

I assume when you said performance you were referring to accuracy, but I thought I would also speak to performance in terms of computational complexity. In order to test a data point using an SVM model, you need to compute the dot product of each support vector with the test point. Therefore the computational complexity of the model is linear in the number of support vectors. Fewer support vectors means faster classification of test points.

A good resource: A Tutorial on Support Vector Machines for Pattern Recognition

109

answered Sep 23 '22 04:09

karenu

800 out of 1000 basically tells you that the SVM needs to use almost every single training sample to encode the training set. That basically tells you that there isn't much regularity in your data.

Sounds like you have major issues with not enough training data. Also, maybe think about some specific features that separate this data better.

answered Sep 23 '22 04:09

Chris A.

Related questions
                            
                                SVM - hard or soft margins?
                            
                                Does Any one got "AttributeError: 'str' object has no attribute 'decode' " , while Loading a Keras Saved Model
                            
                                Linear regression analysis with string/categorical features (variables)?
                            
                                Machine learning in OCaml or Haskell?
                            
                                Tensorflow One Hot Encoder?
                            
                                Ways to improve the accuracy of a Naive Bayes Classifier?
                            
                                What is out of bag error in Random Forests? [closed]
                            
                                Pattern recognition in time series [closed]
                            
                                How to get most informative features for scikit-learn classifiers?
                            
                                Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn
                            
                                why gradient descent when we can solve linear regression analytically
                            
                                Adding L1/L2 regularization in PyTorch?
                            
                                What is the difference between labeled and unlabeled data?
                            
                                Instance Normalisation vs Batch normalisation
                            
                                What are the major differences and benefits of Porter and Lancaster Stemming algorithms? [closed]
                            
                                Estimating the number of neurons and number of layers of an artificial neural network [closed]
                            
                                Extracting an information from web page by machine learning
                            
                                How to save final model using keras?
                            
                                Batch Normalization in Convolutional Neural Network
                            
                                What is inductive bias in machine learning? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the relation between the number of Support Vectors and training data and classifiers performance? [closed]

Tags:

machine-learning

classification

svm

libsvm

Hossein

People also ask

2 Answers

karenu

Chris A.

Recent Activity

Donate For Us