This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn
In that question, I quoted the following code:
>>> import sklearn >>> sklearn.__version__ '0.13.1' >>> from sklearn import svm >>> model = svm.SVC(probability=True) >>> X = [[1,2,3], [2,3,4]] # feature vectors >>> Y = ['apple', 'orange'] # classes >>> model.fit(X, Y) >>> model.predict_proba([1,2,3]) array([[ 0.39097541, 0.60902459]])
I discovered in that question this result represents the probability of the point belonging to each class, in the order given by model.classes_
>>> zip(model.classes_, model.predict_proba([1,2,3])[0]) [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
So... this answer, if interpreted correctly, says that the point is probably an 'orange' (with a fairly low confidence, due to the tiny amount of data). But intuitively, this result is obviously incorrect, since the point given was identical to the training data for 'apple'. Just to be sure, I tested the reverse as well:
>>> zip(model.classes_, model.predict_proba([2,3,4])[0]) [('apple', 0.60705475211840931), ('orange', 0.39294524788159074)]
Again, obviously incorrect, but in the other direction.
Finally, I tried it with points that were much further away.
>>> X = [[1,1,1], [20,20,20]] # feature vectors >>> model.fit(X, Y) >>> zip(model.classes_, model.predict_proba([1,1,1])[0]) [('apple', 0.33333332048410247), ('orange', 0.66666667951589786)]
Again, the model predicts the wrong probabilities. BUT, the model.predict function gets it right!
>>> model.predict([1,1,1])[0] 'apple'
Now, I remember reading something in the docs about predict_proba being inaccurate for small datasets, though I can't seem to find it again. Is this the expected behaviour, or am I doing something wrong? If this IS the expected behaviour, then why does the predict and predict_proba function disagree one the output? And importantly, how big does the dataset need to be before I can trust the results from predict_proba?
-------- UPDATE --------
Ok, so I did some more 'experiments' into this: the behaviour of predict_proba is heavily dependent on 'n', but not in any predictable way!
>>> def train_test(n): ... X = [[1,2,3], [2,3,4]] * n ... Y = ['apple', 'orange'] * n ... model.fit(X, Y) ... print "n =", n, zip(model.classes_, model.predict_proba([1,2,3])[0]) ... >>> train_test(1) n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)] >>> for n in range(1,10): ... train_test(n) ... n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)] n = 2 [('apple', 0.98437355278112448), ('orange', 0.015626447218875527)] n = 3 [('apple', 0.90235408180319321), ('orange', 0.097645918196806694)] n = 4 [('apple', 0.83333299908143665), ('orange', 0.16666700091856332)] n = 5 [('apple', 0.85714254878984497), ('orange', 0.14285745121015511)] n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)] n = 7 [('apple', 0.88888844127886335), ('orange', 0.11111155872113669)] n = 8 [('apple', 0.89999988018127364), ('orange', 0.10000011981872642)] n = 9 [('apple', 0.90909082368682159), ('orange', 0.090909176313178491)]
How should I use this function safely in my code? At the very least, is there any value of n for which it will be guaranteed to agree with the result of model.predict?
predict_proba gives you the probabilities for the target (0 and 1 in your case) in array form. The number of probabilities for each row is equal to the number of categories in target variable (2 in your case).
The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).
predict_proba(X_input) , each row in output consists of 2 columns corresponding to probability of each class.
The predict_proba() returns the number of votes for each class, divided by the number of trees in the forest. Your precision is exactly 1/n_estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000 estimators, which is excessive.
predict_probas
is using the Platt scaling feature of libsvm to callibrate probabilities, see:
So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With