Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does sklearn.svm.svc's function predict_proba() work internally?

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?

like image 316
user2115183 Avatar asked Feb 27 '13 11:02

user2115183


People also ask

How does Sklearn predict_proba work?

The predict_proba() method The method accepts a single argument that corresponds to the data over which the probabilities will be computed and returns an array of lists containing the class probabilities for the input data points.

How does SVM predict probability?

One standard way to obtain a “probability” out of an SVM is to use Platt scaling, which is available in many decent SVM implementations. In the binary case, the probabilities are calibrated using Platt scaling: logistic regression on the SVM's scores, fit by an additional cross-validation on the training data.

What is the output of predict_proba?

predict_proba(X_input) , each row in output consists of 2 columns corresponding to probability of each class.

What is decision function in SVM?

The output of training is a decision function that tells us how close to the line we are (close to the boundary means a low-confidence decision). Positive decision values mean True, Negative decision values mean False. est = svm.


1 Answers

Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.

Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that

P(y|X) = 1 / (1 + exp(A * f(X) + B)) 

where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.

Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.

Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.

like image 155
Fred Foo Avatar answered Sep 19 '22 22:09

Fred Foo