Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get classes labels from cross_val_predict used with predict_proba in scikit-learn

I need to train a Random Forest classifier using a 3-fold cross-validation. For each sample, I need to retrieve the prediction probability when it happens to be in the test set.

I am using scikit-learn version 0.18.dev0.

This new version adds the feature to use the method cross_val_predict() with an additional parameter method to define which kind of prediction require from the estimator.

In my case I want to use the predict_proba() method, which returns the probability for each class, in a multiclass scenario.

However, when I run the method, I get as a result the matrix of prediction probabilities, where each rows represents a sample, and each column represents the prediction probability for a specific class.

The problem is that the method does not indicate which class corresponds to each column.

The value I need is the same (in my case using a RandomForestClassifier) returned in the attribute classes_ defined as:

classes_ : array of shape = [n_classes] or a list of such arrays The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).

which is needed by predict_proba() because in its documentation it is written that:

The order of the classes corresponds to that in the attribute classes_.

A minimal example is the following:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

clf = RandomForestClassifier()

X = np.random.randn(10, 10)
y = y = np.array([1] * 4 + [0] * 3 + [2] * 3)

# how to get classes from here?
proba = cross_val_predict(estimator=clf, X=X, y=y, method="predict_proba")

# using the classifier without cross-validation
# it is possible to get the classes in this way:
clf.fit(X, y)
proba = clf.predict_proba(X)
classes = clf.classes_
like image 925
gc5 Avatar asked Aug 31 '16 18:08

gc5


People also ask

What is the difference between predict () and predict_proba () in Scikit learn?

The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).

What is the output of predict_proba?

predict_proba gives you the probabilities for the target (0 and 1 in your case) in array form. The number of probabilities for each row is equal to the number of categories in target variable (2 in your case).

What does cross_val_predict return?

The function cross_val_predict has a similar interface to cross_val_score , but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set.

What is CV in cross_val_predict?

cvint, cross-validation generator or an iterable, default=None. Determines the cross-validation splitting strategy.


1 Answers

Yes, they will be in sorted order; this is because DecisionTreeClassifier (which is the default base_estimator for RandomForestClassifier) uses np.unique to construct the classes_ attribute which returns the sorted unique values of the input array.

like image 196
maxymoo Avatar answered Oct 12 '22 08:10

maxymoo