I'm using Scikit-learn. Sometimes I need to have the probabilities of labels/classes instead of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam. For such purpose, I'm using <code>predict_proba()</code> with RandomForestClassifier as following: <pre class="prettyprint"><code>clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0) scores = cross_val_score(clf, X, y) print(scores.mean()) classifier = clf.fit(X,y) predictions = classifier.predict_proba(Xtest) print(predictions) </code></pre> And I got those results: <pre class="prettyprint"><code> [ 0.4 0.6] [ 0.1 0.9] [ 0.2 0.8] [ 0.7 0.3] [ 0.3 0.7] [ 0.3 0.7] [ 0.7 0.3] [ 0.4 0.6] </code></pre> Where the second column is for class: Spam. However, I have two main issues with the results about which I am not confident. The first issue is that the results represent the probabilities of the labels without being affected by the size of my data? The second issue is that the results show only one digit which is not very specific in some cases where the 0.701 probability is very different from 0.708. Is there any way to get the next 5 digit for example?

<ol> <li>I get more than one digit in my results, are you sure it is not due to your dataset ? (for example using a very small dataset would yield to simple decision trees and so to 'simple' probabilities). Otherwise it may only be the display that shows one digit, but try to print <code>predictions[0,0]</code>.</li> <li>I am not sure to understand what you mean by "the probabilities aren't affected by the size of my data". If your concern is that you don't want to predict, eg, too many spams, what is usually done is to use a threshold <code>t</code> such that you predict 1 if <code>proba(label==1) > t</code>. This way you can use the threshold to balance your predictions, for example to limit the global probabilty of spams. And if you want to globally analyse your model, we usually compute the Area under the curve (AUC) of the Receiver operating characteristic (ROC) curve (see wikipedia article here). Basically the ROC curve is a description of your predictions depending on the threshold <code>t</code>.</li> </ol> Hope it helps!

Using the predict_proba() function of RandomForestClassifier in the safe and right way

Tags:

python

machine-learning

scikit-learn

random-forest

I'm using Scikit-learn. Sometimes I need to have the probabilities of labels/classes instead of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam.

For such purpose, I'm using predict_proba() with RandomForestClassifier as following:

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
    min_samples_split=1, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())

classifier = clf.fit(X,y)
predictions = classifier.predict_proba(Xtest)
print(predictions)

And I got those results:

 [ 0.4  0.6]
 [ 0.1  0.9]
 [ 0.2  0.8]
 [ 0.7  0.3]
 [ 0.3  0.7]
 [ 0.3  0.7]
 [ 0.7  0.3]
 [ 0.4  0.6]

Where the second column is for class: Spam. However, I have two main issues with the results about which I am not confident. The first issue is that the results represent the probabilities of the labels without being affected by the size of my data? The second issue is that the results show only one digit which is not very specific in some cases where the 0.701 probability is very different from 0.708. Is there any way to get the next 5 digit for example?

567

asked Jun 13 '15 01:06

Clinical

3 Answers

A RandomForestClassifier is a collection of DecisionTreeClassifier's. No matter how big your training set, a decision tree simply returns: a decision. One class has probability 1, the other classes have probability 0.

The RandomForest simply votes among the results. predict_proba() returns the number of votes for each class (each tree in the forest makes its own decision and chooses exactly one class), divided by the number of trees in the forest. Hence, your precision is exactly 1/n_estimators. Want more "precision"? Add more estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000 estimators, which is excessive. You normally don't want more than 100 estimators, and often not that many.

115

answered Sep 28 '22 23:09

Andreus

I get more than one digit in my results, are you sure it is not due to your dataset ? (for example using a very small dataset would yield to simple decision trees and so to 'simple' probabilities). Otherwise it may only be the display that shows one digit, but try to print predictions[0,0].
I am not sure to understand what you mean by "the probabilities aren't affected by the size of my data". If your concern is that you don't want to predict, eg, too many spams, what is usually done is to use a threshold t such that you predict 1 if proba(label==1) > t. This way you can use the threshold to balance your predictions, for example to limit the global probabilty of spams. And if you want to globally analyse your model, we usually compute the Area under the curve (AUC) of the Receiver operating characteristic (ROC) curve (see wikipedia article here). Basically the ROC curve is a description of your predictions depending on the threshold t.

Hope it helps!

answered Sep 28 '22 22:09

Sebastien

I am afraid the top-voted answer isn't correct (at least for the latest sklearn implementation).

According to the docs, the probability of prediction is computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.

In other words, since Random Forest is a collection of decision trees, it predicts the probability of a new sample by averaging over its trees. A single tree calculates the probability by looking at the distribution of different classes within the leaf. Look at this image of a single decision tree to understand what it means to have different classes within the leaf. right leaf in 2nd child split has 75% yellow so prediction probability of class yellow will be 75%. enter image description here

The scenario mentioned in the top-voted answer will only occur when every leaf of all trees have data points belonging to only one class in them.

References:

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Image taken from https://www.displayr.com/how-is-splitting-decided-for-decision-trees/

answered Sep 28 '22 22:09

pyronic

Related questions
                            
                                When importing, why is the entire module loaded even when only specific portions are specified?
                            
                                python string to datetime, find yesterday then back to string
                            
                                Differnce between BINARY_ADD and INPLACE_ADD
                            
                                Why implement two so similar data structures like List and Tuple [duplicate]
                            
                                how to pass char pointer as argument in ctypes python
                            
                                Waiting for a application window: pywinauto.timings.WaitUntilPasses in Python
                            
                                Parsing Android Manifest File to look for the uses-permission tag using python
                            
                                Installing pycuda-2013.1.1 on windows 7 64 bit
                            
                                how to debug requests library?
                            
                                Python argparser repeat subparse
                            
                                cross-compiling Python 2.7.4+
                            
                                Numpy array cannot index within a single []
                            
                                How to dynamically assign values to class properties in Python?
                            
                                loop from end to start
                            
                                Copy and add the last line of a python pandas data frame on to itself with updated index
                            
                                Difference between callable-iterator and listiterator and iterator?
                            
                                Examples of entry_point usage
                            
                                Looking for example using MediaFileUpload
                            
                                Find maximum value and index in a python list?
                            
                                SQLAlchemy: SQL Expression with multiple where conditions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With