Scikit-learn predict_proba gives wrong answers

Tags:

scikit-learn

This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn

In that question, I quoted the following code:

>>> import sklearn >>> sklearn.__version__ '0.13.1' >>> from sklearn import svm >>> model = svm.SVC(probability=True) >>> X = [[1,2,3], [2,3,4]] # feature vectors >>> Y = ['apple', 'orange'] # classes >>> model.fit(X, Y) >>> model.predict_proba([1,2,3]) array([[ 0.39097541,  0.60902459]])

I discovered in that question this result represents the probability of the point belonging to each class, in the order given by model.classes_

>>> zip(model.classes_, model.predict_proba([1,2,3])[0]) [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]

So... this answer, if interpreted correctly, says that the point is probably an 'orange' (with a fairly low confidence, due to the tiny amount of data). But intuitively, this result is obviously incorrect, since the point given was identical to the training data for 'apple'. Just to be sure, I tested the reverse as well:

>>> zip(model.classes_, model.predict_proba([2,3,4])[0]) [('apple', 0.60705475211840931), ('orange', 0.39294524788159074)]

Again, obviously incorrect, but in the other direction.

Finally, I tried it with points that were much further away.

>>> X = [[1,1,1], [20,20,20]] # feature vectors >>> model.fit(X, Y) >>> zip(model.classes_, model.predict_proba([1,1,1])[0]) [('apple', 0.33333332048410247), ('orange', 0.66666667951589786)]

Again, the model predicts the wrong probabilities. BUT, the model.predict function gets it right!

>>> model.predict([1,1,1])[0] 'apple'

Now, I remember reading something in the docs about predict_proba being inaccurate for small datasets, though I can't seem to find it again. Is this the expected behaviour, or am I doing something wrong? If this IS the expected behaviour, then why does the predict and predict_proba function disagree one the output? And importantly, how big does the dataset need to be before I can trust the results from predict_proba?

-------- UPDATE --------

Ok, so I did some more 'experiments' into this: the behaviour of predict_proba is heavily dependent on 'n', but not in any predictable way!

>>> def train_test(n): ...     X = [[1,2,3], [2,3,4]] * n ...     Y = ['apple', 'orange'] * n ...     model.fit(X, Y) ...     print "n =", n, zip(model.classes_, model.predict_proba([1,2,3])[0]) ...  >>> train_test(1) n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)] >>> for n in range(1,10): ...     train_test(n) ...  n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)] n = 2 [('apple', 0.98437355278112448), ('orange', 0.015626447218875527)] n = 3 [('apple', 0.90235408180319321), ('orange', 0.097645918196806694)] n = 4 [('apple', 0.83333299908143665), ('orange', 0.16666700091856332)] n = 5 [('apple', 0.85714254878984497), ('orange', 0.14285745121015511)] n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)] n = 7 [('apple', 0.88888844127886335), ('orange', 0.11111155872113669)] n = 8 [('apple', 0.89999988018127364), ('orange', 0.10000011981872642)] n = 9 [('apple', 0.90909082368682159), ('orange', 0.090909176313178491)]

How should I use this function safely in my code? At the very least, is there any value of n for which it will be guaranteed to agree with the result of model.predict?

378

asked Jun 10 '13 06:06

Alex

1 Answers

predict_probas is using the Platt scaling feature of libsvm to callibrate probabilities, see:

How does sklearn.svm.svc's function predict_proba() work internally?

So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.

105

answered Oct 14 '22 17:10

ogrisel

Related questions
                            
                                Two dimensional array in python
                            
                                Python does not see pygraphviz
                            
                                NumPy first and last element from array
                            
                                User Registration with error: no such table: auth_user
                            
                                Verbose level with argparse and multiple -v options
                            
                                What PEP 8 guidelines do you ignore, and which ones do you stick to? [closed]
                            
                                flask_sqlalchemy `pool_pre_ping` only working sometimes
                            
                                Python: Good place to learn about `multiprocessing.Manager`? [closed]
                            
                                How to organize multiple python files into a single module without it behaving like a package?
                            
                                How do I get 'real-time' information back from a subprocess.Popen in python (2.5)
                            
                                What is the difference between concurrent.futures and asyncio.futures?
                            
                                Install dependencies from setup.py
                            
                                Weird scoping behavior in python
                            
                                What is the Python equivalent of Tomcat?
                            
                                Winpty and Git Bash
                            
                                Find out whether celery task exists
                            
                                Python alternative to reduce()
                            
                                Disable hash randomization from within python program
                            
                                Why do "Not a Number" values equal True when cast as boolean in Python/Numpy?
                            
                                What does python3 open "x" mode do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With