I have a program that uses the SVC class from sklearn. Really, I'm using the OneVsRestClassifier class which uses the SVC class. My problem is that the predict_proba() method sometimes returns an vector that's too short. This is because the classes_ attribute is missing a class, which happens when a label isn't present during training.
Consider the following example (code shown below). Suppose all possible classes are 1, 2, 3, and 4. Now suppose training data just happens to not contain any data labeled with class 3. This is fine, except when I call predict_proba() I want a vector of length 4. Instead, I get a vector of length 3. That is, predict_proba() returns [p(1) p(2) p(4)], but I want [p(1) p(2) p(3) p(4)], where p(3) = 0.
I guess clf.classes_ is implicitly defined by the labels seen during training, which is incomplete in this case. Is there any way I can explicitly set the possible class labels? I know a simple work around is to just take the predict_proba() output and manually create the array I want. However, this is inconvenient and might slow my program down quite a bit.
# Python 2.7.6
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
import numpy as np
X_train = [[1], [2], [4]] * 10
y = [1, 2, 4] * 10
X_test = [[1]]
clf = OneVsRestClassifier(SVC(probability=True, kernel="linear"))
clf.fit(X_train, y)
# calling predict_proba() gives: [p(1) p(2) p(4)]
# I want: [p(1) p(2) p(3) p(4)], where p(3) = 0
print clf.predict_proba(X_test)
The work-around I had in mind creates a new list of probabilities and builds it one element at a time with multiple append() calls (see code below). This seems like it would be slow compared to having predict_proba() return what I want automatically. I don't know yet if it will significantly slow my program because I haven't tried it yet. Regardless, I wanted to know if there was a better way.
def workAround(probs, classes_, all_classes):
"""
probs: list of probabilities, output of predict_proba (but 1D)
classes_: clf.classes_
all_classes: all possible classes; superset of classes_
"""
all_probs = []
i = 0 # index into probs and classes_
for cls in all_classes:
if cls == classes_[i]:
all_probs.append(probs[i])
i += 1
else:
all_probs.append(0.0)
return np.asarray(all_probs)
As said in the comments, scikit-learn provides no way to explicitly set the possible class labels.
I NumPyfied your workaround:
import sklearn
import sklearn.svm
import numpy as np
np.random.seed(3) # for reproducibility
def predict_proba_ordered(probs, classes_, all_classes):
"""
probs: list of probabilities, output of predict_proba
classes_: clf.classes_
all_classes: all possible classes (superset of classes_)
"""
proba_ordered = np.zeros((probs.shape[0], all_classes.size), dtype=np.float)
sorter = np.argsort(all_classes) # http://stackoverflow.com/a/32191125/395857
idx = sorter[np.searchsorted(all_classes, classes_, sorter=sorter)]
proba_ordered[:, idx] = probs
return proba_ordered
# Prepare the data set
all_classes = np.array([1,2,3,4]) # explicitly set the possible class labels.
X_train = [[1], [2], [4]] * 3
print('X_train: {0}'.format(X_train))
y = [1, 2, 4] * 3 # Label 3 is missing.
print('y: {0}'.format(y))
X_test = [[1], [2], [3]]
print('X_test: {0}'.format(X_test))
# Train
clf = sklearn.svm.SVC(probability=True, kernel="linear")
clf.fit(X_train, y)
print('clf.classes_: {0}'.format(clf.classes_))
# Predict
probs = clf.predict_proba(X_test) #As label 3 isn't in train set, the probs' size is 3, not 4
proba_ordered = predict_proba_ordered(probs, clf.classes_, all_classes)
print('proba_ordered: {0}'.format(proba_ordered))
Output:
X_train: [[1], [2], [4], [1], [2], [4], [1], [2], [4]]
y: [1, 2, 4, 1, 2, 4, 1, 2, 4]
X_test: [[1], [2], [3]]
clf.classes_: [1 2 4]
proba_ordered: [[ 0.81499201 0.08640176 0. 0.09860622]
[ 0.21105955 0.63893181 0. 0.15000863]
[ 0.08965731 0.49640147 0. 0.41394122]]
Note that you can explicitly set the possible class labels in sklearn.metrics
(e.g. sklearn.metrics.f1_score
using the labels
parameters:
labels : array
Integer array of labels.
Example:
# Score
y_pred = clf.predict(X_test)
y_true = np.array([1,2,3])
precision = sklearn.metrics.precision_score(y_true, y_pred, labels=all_classes, average=None)
print('precision: {0}'.format(precision))
recall = sklearn.metrics.recall_score(y_true, y_pred, labels=all_classes, average=None)
print('recall: {0}'.format(recall))
f1_score = sklearn.metrics.f1_score(y_true, y_pred, labels=all_classes, average=None)
print('f1_score: {0}'.format(f1_score))
Note that as of now you'll run into issue issue try using sklearn.metrics.roc_auc_score()
when no positive example is in the ground truth for a given label .
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With