I need to train a Random Forest classifier using a 3-fold cross-validation. For each sample, I need to retrieve the prediction probability when it happens to be in the test set. I am using scikit-learn version 0.18.dev0. This new version adds the feature to use the method cross_val_predict() with an additional parameter <code>method</code> to define which kind of prediction require from the estimator. In my case I want to use the predict_proba() method, which returns the probability for each class, in a multiclass scenario. However, when I run the method, I get as a result the matrix of prediction probabilities, where each rows represents a sample, and each column represents the prediction probability for a specific class. The problem is that the method does not indicate which class corresponds to each column. The value I need is the same (in my case using a <code>RandomForestClassifier</code>) returned in the attribute classes_ defined as: <blockquote> classes_ : array of shape = [n_classes] or a list of such arrays The classes labels (single output problem), or a list of arrays of class labels (multi-output problem). </blockquote> which is needed by <code>predict_proba()</code> because in its documentation it is written that: <blockquote> The order of the classes corresponds to that in the attribute classes_. </blockquote> A minimal example is the following: <pre class="prettyprint"><code>import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_predict clf = RandomForestClassifier() X = np.random.randn(10, 10) y = y = np.array([1] * 4 + [0] * 3 + [2] * 3) # how to get classes from here? proba = cross_val_predict(estimator=clf, X=X, y=y, method="predict_proba") # using the classifier without cross-validation # it is possible to get the classes in this way: clf.fit(X, y) proba = clf.predict_proba(X) classes = clf.classes_ </code></pre>

Yes, they will be in sorted order; this is because <code>DecisionTreeClassifier</code> (which is the default <code>base_estimator</code> for <code>RandomForestClassifier</code>) uses <code>np.unique</code> to construct the <code>classes_</code> attribute which returns the sorted unique values of the input array.

How to get classes labels from cross_val_predict used with predict_proba in scikit-learn

Tags:

python

scikit-learn

cross-validation

I need to train a Random Forest classifier using a 3-fold cross-validation. For each sample, I need to retrieve the prediction probability when it happens to be in the test set.

I am using scikit-learn version 0.18.dev0.

This new version adds the feature to use the method cross_val_predict() with an additional parameter method to define which kind of prediction require from the estimator.

In my case I want to use the predict_proba() method, which returns the probability for each class, in a multiclass scenario.

However, when I run the method, I get as a result the matrix of prediction probabilities, where each rows represents a sample, and each column represents the prediction probability for a specific class.

The problem is that the method does not indicate which class corresponds to each column.

The value I need is the same (in my case using a RandomForestClassifier) returned in the attribute classes_ defined as:

classes_ : array of shape = [n_classes] or a list of such arrays The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).

which is needed by predict_proba() because in its documentation it is written that:

The order of the classes corresponds to that in the attribute classes_.

A minimal example is the following:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

clf = RandomForestClassifier()

X = np.random.randn(10, 10)
y = y = np.array([1] * 4 + [0] * 3 + [2] * 3)

# how to get classes from here?
proba = cross_val_predict(estimator=clf, X=X, y=y, method="predict_proba")

# using the classifier without cross-validation
# it is possible to get the classes in this way:
clf.fit(X, y)
proba = clf.predict_proba(X)
classes = clf.classes_

925

asked Aug 31 '16 18:08

gc5

1 Answers

Yes, they will be in sorted order; this is because DecisionTreeClassifier (which is the default base_estimator for RandomForestClassifier) uses np.unique to construct the classes_ attribute which returns the sorted unique values of the input array.

196

answered Oct 12 '22 08:10

maxymoo

Related questions
                            
                                Streaming data for pandas df
                            
                                Pass FILE * into function from Python / ctypes
                            
                                How to make python scripts pipe-able both in bash and within python
                            
                                How to Access/Download OneNote notebook with Python?
                            
                                Dask DataFrame Groupby Partitions
                            
                                Adding Colorbar to a Spectrogram
                            
                                pytest fixture of fixtures
                            
                                extracting phase information using numpy fft
                            
                                Plotly: How to add borders and sidelabels to subplots, and syncronize panning?
                            
                                Is it possible to let PyCharm auto break line when writing long docstrings and comments?
                            
                                How to use Pretty Table in Python to print out data from multiple lists?
                            
                                Pandas to_dict unwantedly modifying float numbers
                            
                                How can I use conda skeleton with packages that are not uploaded to pypi?
                            
                                How to register Entry Points for network python package installs?
                            
                                Python with non-latin-1 PYTHONHOME path
                            
                                Found array with 0 sample(s) (shape=(0, 40)) while a minimum of 1 is required
                            
                                Does Django's singleton architecture make it unworkable as a standalone ORM in a library?
                            
                                How can I run Processing's Python mode in non-Processing IDEs?
                            
                                How to monkey patch python list __setitem__ method
                            
                                good style to introduce python variables within a loop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With