From my research, I found three conflicting results: <ol> <li><code>SVC(kernel="linear")</code> is better</li> <li><code>LinearSVC</code> is better</li> <li>Doesn't matter</li> </ol> Can someone explain when to use <code>LinearSVC</code> vs. <code>SVC(kernel="linear")</code>? It seems like LinearSVC is marginally better than SVC and is usually more finicky. But if <code>scikit</code> decided to spend time on implementing a specific case for linear classification, why wouldn't <code>LinearSVC</code> outperform <code>SVC</code>?

Mathematically, optimizing an SVM is a convex optimization problem, usually with a unique minimizer. This means that there is only one solution to this mathematical optimization problem. The differences in results come from several aspects: <code>SVC</code> and <code>LinearSVC</code> are supposed to optimize the same problem, but in fact all <code>liblinear</code> estimators penalize the intercept, whereas <code>libsvm</code> ones don't (IIRC). This leads to a different mathematical optimization problem and thus different results. There may also be other subtle differences such as scaling and default loss function (edit: make sure you set <code>loss='hinge'</code> in <code>LinearSVC</code>). Next, in multiclass classification, <code>liblinear</code> does one-vs-rest by default whereas <code>libsvm</code> does one-vs-one. <code>SGDClassifier(loss='hinge')</code> is different from the other two in the sense that it uses stochastic gradient descent and not exact gradient descent and may not converge to the same solution. However the obtained solution may generalize better. Between <code>SVC</code> and <code>LinearSVC</code>, one important decision criterion is that <code>LinearSVC</code> tends to be faster to converge the larger the number of samples is. This is due to the fact that the linear kernel is a special case, which is optimized for in Liblinear, but not in Libsvm.

The actual problem is in the problem with scikit approach, where they call SVM something which is not SVM. LinearSVC is actually minimizing squared hinge loss, instead of just hinge loss, furthermore, it penalizes size of the bias (which is not SVM), for more details refer to other question: Under what parameters are SVC and LinearSVC in scikit-learn equivalent? So which one to use? It is purely problem specific. As due to no free lunch theorem it is impossible to say "this loss function is best, period". Sometimes squared loss will work better, sometimes normal hinge.

When should one use LinearSVC or SVC?

Tags:

machine-learning

svm

scikit-learn

From my research, I found three conflicting results:

SVC(kernel="linear") is better
LinearSVC is better
Doesn't matter

Can someone explain when to use LinearSVC vs. SVC(kernel="linear")?

It seems like LinearSVC is marginally better than SVC and is usually more finicky. But if scikit decided to spend time on implementing a specific case for linear classification, why wouldn't LinearSVC outperform SVC?

915

asked Jan 29 '16 03:01

THIS USER NEEDS HELP

2 Answers

Mathematically, optimizing an SVM is a convex optimization problem, usually with a unique minimizer. This means that there is only one solution to this mathematical optimization problem.

The differences in results come from several aspects: SVC and LinearSVC are supposed to optimize the same problem, but in fact all liblinear estimators penalize the intercept, whereas libsvm ones don't (IIRC). This leads to a different mathematical optimization problem and thus different results. There may also be other subtle differences such as scaling and default loss function (edit: make sure you set loss='hinge' in LinearSVC). Next, in multiclass classification, liblinear does one-vs-rest by default whereas libsvm does one-vs-one.

SGDClassifier(loss='hinge') is different from the other two in the sense that it uses stochastic gradient descent and not exact gradient descent and may not converge to the same solution. However the obtained solution may generalize better.

Between SVC and LinearSVC, one important decision criterion is that LinearSVC tends to be faster to converge the larger the number of samples is. This is due to the fact that the linear kernel is a special case, which is optimized for in Liblinear, but not in Libsvm.

127

answered Oct 08 '22 00:10

eickenberg

The actual problem is in the problem with scikit approach, where they call SVM something which is not SVM. LinearSVC is actually minimizing squared hinge loss, instead of just hinge loss, furthermore, it penalizes size of the bias (which is not SVM), for more details refer to other question: Under what parameters are SVC and LinearSVC in scikit-learn equivalent?

So which one to use? It is purely problem specific. As due to no free lunch theorem it is impossible to say "this loss function is best, period". Sometimes squared loss will work better, sometimes normal hinge.

answered Oct 08 '22 01:10

lejlot

Related questions
                            
                                How to turn off dropout for testing in Tensorflow?
                            
                                Tensorflow Slim: TypeError: Expected int32, got list containing Tensors of type '_Message' instead
                            
                                Get learning rate of keras model
                            
                                Simple Python implementation of collaborative topic modeling?
                            
                                Tackling Class Imbalance: scaling contribution to loss and sgd
                            
                                confused about random_state in decision tree of scikit learn
                            
                                Python Implementation of OPTICS (Clustering) Algorithm
                            
                                What is Depth of a convolutional neural network?
                            
                                Early stopping with Keras and sklearn GridSearchCV cross-validation
                            
                                Why should we use Temperature in softmax? [closed]
                            
                                How do you read Tensorboard files programmatically?
                            
                                How to recognize rectangles in this image?
                            
                                What is the difference between reinforcement learning and deep RL?
                            
                                Best machine learning technique for matching product strings
                            
                                Distinguishing overfitting vs good prediction
                            
                                How to choose number of hidden layers and nodes in neural network? [closed]
                            
                                Which machine learning library to use [closed]
                            
                                Classifying Documents into Categories
                            
                                Keras flowFromDirectory get file names as they are being generated
                            
                                Recommended anomaly detection technique for simple, one-dimensional scenario?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With