I am trying to use scikit-learn 0.12.1 to:
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
Here's a workaround. Make sure you have a list of all classes called all_classes
. Then, if clf
is your LogisticRegression
classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
produces a list of (cls, prob)
pairs.
If what you want is an array like that returned by predict_proba
, but with columns corresponding to sorted all_classes
, how about:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With