Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn - How to predict probability for all target labels

I have a data set with a target variable that can have 7 different labels. Each sample in my training set has only one label for the target variable.

For each sample, I want to calculate the probability for each of the target labels. So my prediction would consist of 7 probabilities for each row.

On the sklearn website I read about multi-label classification, but this doesn't seem to be what I want.

I tried the following code, but this only gives me one classification per sample.

from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

Does anyone have some advice on this? Thanks!

like image 980
Bert Carremans Avatar asked Jul 15 '16 19:07

Bert Carremans


People also ask

What does predict () function of sklearn do?

The Sklearn 'Predict' Method Predicts an OutputThat being the case, it provides a set of tools for doing things like training and evaluating machine learning models. And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).

How do you find predicted probability in Python?

The sklearn library has the predict_proba() command that can be used to generate a two column array, the first column being the probability that the outcome will be 0 and the second being the probability that the outcome will be 1. The sum of each row of the two columns should also equal one.

What is the difference between predict () and predict_proba () in Scikit learn?

The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).

What does model predict_proba () do in sklearn?

model. predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.


3 Answers

You can do that by simply removing the OneVsRestClassifer and using predict_proba method of the DecisionTreeClassifier. You can do the following:

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)

This will give you a probability for each of your 7 possible classes.

Hope that helps!

like image 91
Abhinav Arora Avatar answered Oct 18 '22 20:10

Abhinav Arora


You can try using scikit-multilearn - an extension of sklearn that handles multilabel classification. If your labels are not overly correlated you can train one classifier per label and get all predictions - try (after pip install scikit-multilearn):

from skmultilearn.problem_transform import BinaryRelevance    
classifier = BinaryRelevance(classifier = DecisionTreeClassifier())

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

Predictions will contain a sparse matrix of size (n_samples, n_labels) in your case - n_labels = 7, each column contains prediction per label for all samples.

In case your labels are correlated you might need more sophisticated methods for multi-label classification.

Disclaimer: I'm the author of scikit-multilearn, feel free to ask more questions.

like image 32
niedakh Avatar answered Oct 18 '22 20:10

niedakh


If you insist on using the OneVsRestClassifer, then you could also call predict_proba(X_test) as it is supported by OneVsRestClassifer as well.

For eg:

from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)

The order of the labels for which you get the result can be found in:

clf.classes_
like image 30
SA1T Avatar answered Oct 18 '22 18:10

SA1T