Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn predict_proba gives wrong answers

This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn

In that question, I quoted the following code:

>>> import sklearn >>> sklearn.__version__ '0.13.1' >>> from sklearn import svm >>> model = svm.SVC(probability=True) >>> X = [[1,2,3], [2,3,4]] # feature vectors >>> Y = ['apple', 'orange'] # classes >>> model.fit(X, Y) >>> model.predict_proba([1,2,3]) array([[ 0.39097541,  0.60902459]]) 

I discovered in that question this result represents the probability of the point belonging to each class, in the order given by model.classes_

>>> zip(model.classes_, model.predict_proba([1,2,3])[0]) [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)] 

So... this answer, if interpreted correctly, says that the point is probably an 'orange' (with a fairly low confidence, due to the tiny amount of data). But intuitively, this result is obviously incorrect, since the point given was identical to the training data for 'apple'. Just to be sure, I tested the reverse as well:

>>> zip(model.classes_, model.predict_proba([2,3,4])[0]) [('apple', 0.60705475211840931), ('orange', 0.39294524788159074)] 

Again, obviously incorrect, but in the other direction.

Finally, I tried it with points that were much further away.

>>> X = [[1,1,1], [20,20,20]] # feature vectors >>> model.fit(X, Y) >>> zip(model.classes_, model.predict_proba([1,1,1])[0]) [('apple', 0.33333332048410247), ('orange', 0.66666667951589786)] 

Again, the model predicts the wrong probabilities. BUT, the model.predict function gets it right!

>>> model.predict([1,1,1])[0] 'apple' 

Now, I remember reading something in the docs about predict_proba being inaccurate for small datasets, though I can't seem to find it again. Is this the expected behaviour, or am I doing something wrong? If this IS the expected behaviour, then why does the predict and predict_proba function disagree one the output? And importantly, how big does the dataset need to be before I can trust the results from predict_proba?

-------- UPDATE --------

Ok, so I did some more 'experiments' into this: the behaviour of predict_proba is heavily dependent on 'n', but not in any predictable way!

>>> def train_test(n): ...     X = [[1,2,3], [2,3,4]] * n ...     Y = ['apple', 'orange'] * n ...     model.fit(X, Y) ...     print "n =", n, zip(model.classes_, model.predict_proba([1,2,3])[0]) ...  >>> train_test(1) n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)] >>> for n in range(1,10): ...     train_test(n) ...  n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)] n = 2 [('apple', 0.98437355278112448), ('orange', 0.015626447218875527)] n = 3 [('apple', 0.90235408180319321), ('orange', 0.097645918196806694)] n = 4 [('apple', 0.83333299908143665), ('orange', 0.16666700091856332)] n = 5 [('apple', 0.85714254878984497), ('orange', 0.14285745121015511)] n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)] n = 7 [('apple', 0.88888844127886335), ('orange', 0.11111155872113669)] n = 8 [('apple', 0.89999988018127364), ('orange', 0.10000011981872642)] n = 9 [('apple', 0.90909082368682159), ('orange', 0.090909176313178491)] 

How should I use this function safely in my code? At the very least, is there any value of n for which it will be guaranteed to agree with the result of model.predict?

like image 378
Alex Avatar asked Jun 10 '13 06:06

Alex


People also ask

What does model predict_proba () do in Sklearn?

predict_proba gives you the probabilities for the target (0 and 1 in your case) in array form. The number of probabilities for each row is equal to the number of categories in target variable (2 in your case).

What is the difference between predict () and predict_proba () in Scikit learn?

The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).

What is the output of predict_proba?

predict_proba(X_input) , each row in output consists of 2 columns corresponding to probability of each class.

What is predict_proba in random forest?

The predict_proba() returns the number of votes for each class, divided by the number of trees in the forest. Your precision is exactly 1/n_estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000 estimators, which is excessive.


1 Answers

predict_probas is using the Platt scaling feature of libsvm to callibrate probabilities, see:

  • How does sklearn.svm.svc's function predict_proba() work internally?

So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.

like image 105
ogrisel Avatar answered Oct 14 '22 17:10

ogrisel