I have a data set with a target variable that can have 7 different labels. Each sample in my training set has only one label for the target variable.
For each sample, I want to calculate the probability for each of the target labels. So my prediction would consist of 7 probabilities for each row.
On the sklearn website I read about multi-label classification, but this doesn't seem to be what I want.
I tried the following code, but this only gives me one classification per sample.
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
Does anyone have some advice on this? Thanks!
The Sklearn 'Predict' Method Predicts an OutputThat being the case, it provides a set of tools for doing things like training and evaluating machine learning models. And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).
The sklearn library has the predict_proba() command that can be used to generate a two column array, the first column being the probability that the outcome will be 0 and the second being the probability that the outcome will be 1. The sum of each row of the two columns should also equal one.
The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).
model. predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.
You can do that by simply removing the OneVsRestClassifer
and using predict_proba
method of the DecisionTreeClassifier
. You can do the following:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)
This will give you a probability for each of your 7 possible classes.
Hope that helps!
You can try using scikit-multilearn - an extension of sklearn that handles multilabel classification. If your labels are not overly correlated you can train one classifier per label and get all predictions - try (after pip install scikit-multilearn):
from skmultilearn.problem_transform import BinaryRelevance
classifier = BinaryRelevance(classifier = DecisionTreeClassifier())
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
Predictions will contain a sparse matrix of size (n_samples, n_labels) in your case - n_labels = 7, each column contains prediction per label for all samples.
In case your labels are correlated you might need more sophisticated methods for multi-label classification.
Disclaimer: I'm the author of scikit-multilearn, feel free to ask more questions.
If you insist on using the OneVsRestClassifer
, then you could also call predict_proba(X_test)
as it is supported by OneVsRestClassifer
as well.
For eg:
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)
The order of the labels for which you get the result can be found in:
clf.classes_
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With