Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I explicitly set the list of possible classes for an sklearn SVM?

I have a program that uses the SVC class from sklearn. Really, I'm using the OneVsRestClassifier class which uses the SVC class. My problem is that the predict_proba() method sometimes returns an vector that's too short. This is because the classes_ attribute is missing a class, which happens when a label isn't present during training.

Consider the following example (code shown below). Suppose all possible classes are 1, 2, 3, and 4. Now suppose training data just happens to not contain any data labeled with class 3. This is fine, except when I call predict_proba() I want a vector of length 4. Instead, I get a vector of length 3. That is, predict_proba() returns [p(1) p(2) p(4)], but I want [p(1) p(2) p(3) p(4)], where p(3) = 0.

I guess clf.classes_ is implicitly defined by the labels seen during training, which is incomplete in this case. Is there any way I can explicitly set the possible class labels? I know a simple work around is to just take the predict_proba() output and manually create the array I want. However, this is inconvenient and might slow my program down quite a bit.

# Python 2.7.6

from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
import numpy as np

X_train = [[1], [2], [4]] * 10
y = [1, 2, 4] * 10
X_test = [[1]]

clf = OneVsRestClassifier(SVC(probability=True, kernel="linear"))
clf.fit(X_train, y)

# calling predict_proba() gives: [p(1) p(2) p(4)]
# I want: [p(1) p(2) p(3) p(4)], where p(3) = 0
print clf.predict_proba(X_test)

The work-around I had in mind creates a new list of probabilities and builds it one element at a time with multiple append() calls (see code below). This seems like it would be slow compared to having predict_proba() return what I want automatically. I don't know yet if it will significantly slow my program because I haven't tried it yet. Regardless, I wanted to know if there was a better way.

def workAround(probs, classes_, all_classes):
    """
    probs: list of probabilities, output of predict_proba (but 1D)
    classes_: clf.classes_
    all_classes: all possible classes; superset of classes_
    """
    all_probs = []
    i = 0  # index into probs and classes_

    for cls in all_classes:
        if cls == classes_[i]:
            all_probs.append(probs[i])
            i += 1
        else:
            all_probs.append(0.0)

    return np.asarray(all_probs)
like image 563
Josh Kelle Avatar asked Oct 19 '22 12:10

Josh Kelle


1 Answers

As said in the comments, scikit-learn provides no way to explicitly set the possible class labels.

I NumPyfied your workaround:

import sklearn
import sklearn.svm
import numpy as np
np.random.seed(3) # for reproducibility

def predict_proba_ordered(probs, classes_, all_classes):
    """
    probs: list of probabilities, output of predict_proba 
    classes_: clf.classes_
    all_classes: all possible classes (superset of classes_)
    """
    proba_ordered = np.zeros((probs.shape[0], all_classes.size),  dtype=np.float)
    sorter = np.argsort(all_classes) # http://stackoverflow.com/a/32191125/395857
    idx = sorter[np.searchsorted(all_classes, classes_, sorter=sorter)]
    proba_ordered[:, idx] = probs
    return proba_ordered

# Prepare the data set
all_classes = np.array([1,2,3,4]) # explicitly set the possible class labels.
X_train = [[1], [2], [4]] * 3
print('X_train: {0}'.format(X_train))
y = [1, 2, 4] * 3 # Label 3 is missing.
print('y: {0}'.format(y))
X_test = [[1], [2], [3]]
print('X_test: {0}'.format(X_test))

# Train
clf = sklearn.svm.SVC(probability=True, kernel="linear")
clf.fit(X_train, y)
print('clf.classes_: {0}'.format(clf.classes_))

# Predict
probs = clf.predict_proba(X_test) #As label 3 isn't in train set, the probs' size is 3, not 4
proba_ordered = predict_proba_ordered(probs, clf.classes_, all_classes)
print('proba_ordered: {0}'.format(proba_ordered))

Output:

X_train: [[1], [2], [4], [1], [2], [4], [1], [2], [4]]
y: [1, 2, 4, 1, 2, 4, 1, 2, 4]
X_test: [[1], [2], [3]]
clf.classes_: [1 2 4]
proba_ordered: [[ 0.81499201  0.08640176  0.          0.09860622]
                [ 0.21105955  0.63893181  0.          0.15000863]
                [ 0.08965731  0.49640147  0.          0.41394122]]

Note that you can explicitly set the possible class labels in sklearn.metrics (e.g. sklearn.metrics.f1_score using the labels parameters:

labels : array
Integer array of labels.

Example:

# Score
y_pred = clf.predict(X_test)
y_true = np.array([1,2,3])
precision = sklearn.metrics.precision_score(y_true, y_pred, labels=all_classes, average=None)
print('precision: {0}'.format(precision))
recall = sklearn.metrics.recall_score(y_true, y_pred, labels=all_classes, average=None)
print('recall: {0}'.format(recall))
f1_score = sklearn.metrics.f1_score(y_true, y_pred, labels=all_classes, average=None)
print('f1_score: {0}'.format(f1_score))

Note that as of now you'll run into issue issue try using sklearn.metrics.roc_auc_score() when no positive example is in the ground truth for a given label .

like image 192
Franck Dernoncourt Avatar answered Oct 30 '22 02:10

Franck Dernoncourt