What's the relationship between an SVM and hinge loss?

Q: What does the hinge in hinge loss in the soft margin SVM classifier refer to?

The hinge loss is a special type of cost function that not only penalizes misclassified samples but also correctly classified ones that are within a defined margin from the decision boundary. The hinge loss function is most commonly employed to regularize soft margin support vector machines.

Q: Does SVM have loss function?

In SVM models, we use the loss function to measure the empirical risk of given training set. The characteristics and performance of a SVM model depends upon the way it measures the empirical error of the given training set. Therefore, the choice of loss function is very crucial in SVM models.

Tags:

machine-learning

svm

logistic-regression

My colleague and I are trying to wrap our heads around the difference between logistic regression and an SVM. Clearly they are optimizing different objective functions. Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? How do the support vectors come into play? What about the slack variables? Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?

624

asked Dec 17 '15 02:12

Simon

1 Answers

I will answer one thing at at time

Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss?

SVM is simply a linear classifier, optimizing hinge loss with L2 regularization.

Or is it more complex than that?

No, it is "just" that, however there are different ways of looking at this model leading to complex, interesting conclusions. In particular, this specific choice of loss function leads to extremely efficient kernelization, which is not true for log loss (logistic regression) nor mse (linear regression). Furthermore you can show very important theoretical properties, such as those related to Vapnik-Chervonenkis dimension reduction leading to smaller chance of overfitting.

Intuitively look at these three common losses:

hinge: max(0, 1-py)
log: y log p
mse: (p-y)^2

Only the first one has the property that once something is classified correctly - it has 0 penalty. All the remaining ones still penalize your linear model even if it classifies samples correctly. Why? Because they are more related to regression than classification they want a perfect prediction, not just correct.

How do the support vectors come into play?

Support vectors are simply samples placed near the decision boundary (losely speaking). For linear case it does not change much, but as most of the power of SVM lies in its kernelization - there SVs are extremely important. Once you introduce kernel, due to hinge loss, SVM solution can be obtained efficiently, and support vectors are the only samples remembered from the training set, thus building a non-linear decision boundary with the subset of the training data.

What about the slack variables?

This is just another definition of the hinge loss, more usefull when you want to kernelize the solution and show the convexivity.

Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?

You can, however as SVM is not a probabilistic model, its training might be a bit tricky. Furthermore whole strength of SVM comes from efficiency and global solution, both would be lost once you create a deep network. However there are such models, in particular SVM (with squared hinge loss) is nowadays often choice for the topmost layer of deep networks - thus the whole optimization is actually a deep SVM. Adding more layers in between has nothing to do with SVM or other cost - they are defined completely by their activations, and you can for example use RBF activation function, simply it has been shown numerous times that it leads to weak models (to local features are detected).

To sum up:

there are deep SVMs, simply this is a typical deep neural network with SVM layer on top.
there is no such thing as putting SVM layer "in the middle", as the training criterion is actually only applied to the output of the network.
using of "typical" SVM kernels as activation functions is not popular in deep networks due to their locality (as opposed to very global relu or sigmoid)

176

answered Oct 10 '22 06:10

lejlot

Related questions
                            
                                Getting weights of features using scikit-learn Logistic Regression
                            
                                Plotting a pie chart out of a dictionary
                            
                                Which of the parameters in LibSVM is the slack variable?
                            
                                How to implement Word2Vec in Java?
                            
                                Adding metrics to existing model in Keras
                            
                                Best way to compare meaning of text documents?
                            
                                plot decision boundary matplotlib
                            
                                N-grams vs other classifiers in text categorization
                            
                                How to deploy machine learning algorithm in production environment?
                            
                                How does Monoid assist in parallel training?
                            
                                Count number of columns in pyspark Dataframe?
                            
                                Modify Keras model after training
                            
                                Normalizing Rewards to Generate Returns in reinforcement learning
                            
                                Generate random timeseries data with dates
                            
                                Trying to Understand FB Prophet Cross Validation
                            
                                TensorFlow: SparseSoftmaxCrossEntropyWithLogits Error?
                            
                                How To Save Keras Regressor Model?
                            
                                How does mean image subtraction work?
                            
                                How can I view Tensor as an image?
                            
                                Inputting Data with Haskell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With