I'm doing a binary classification .. I've an imbalanced data and I've used the svm weight in trying to mitigate the situation ... As you can see I've calculated and plot the roc curve for each class and I've got the following plot: It looks like the two classes some up to one .. and I'm n't sure if I'm doing the right thing or not because its the first time for me to draw my own roc curve ... I'm using Scikit learn to plot ... is it right to plot each class alone .. and is the classifier failing in classifying the blue class ?
this is the code that I've used to get the plot:
y_pred = clf.predict_proba(X_test)[:,0] # for calculating the probability of the first class
y_pred2 = clf.predict_proba(X_test)[:,1] # for calculating the probability of the second class
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
auc=metrics.auc(fpr, tpr)
print "auc for the first class",auc
fpr2, tpr2, thresholds2 = metrics.roc_curve(y_test, y_pred2)
auc2=metrics.auc(fpr2, tpr2)
print "auc for the second class",auc2
# ploting the roc curve
plt.plot(fpr,tpr)
plt.plot(fpr2,tpr2)
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('Roc curve')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend(loc="lower right")
plt.show()
I know there is a better way to write as a dictionary for example but I was just trying to see the curve first
See the Wikipedia entry for a all your ROC curve needs :)
predict_proba
returns class probabilities for each class. The first column contains the probability of the first class and the second column contains the probability of the second class. Note that the two curves are rotated versions of each other. That is because the class probabilities add up to 1.
The documentation of roc_curve
states that the second parameter must contain
Target scores, can either be probability estimates of the positive class or confidence values.
This means you have to pass the probabilities that corresponds to class 1. Most likely this is the second column.
You get the blue curve because you passed the probabilities of the wrong class (first column). Only the green curve is correct.
It does not make sense to compute ROC curves for each class, because the ROC curve describes the ability of the classifier to distinguish two classes. You have only one curve per classifier.
The specific problem is a coding mistake.
predict_proba
returns class probabilities (1 if it's certainly the class, 0 if it is definitly not the class, usually it's something in-between).
metrics.roc_curve(y_test, y_pred)
now compares class labels against probabilities, which is like comparing pears against apple juice.
You should use predict
instead of predict_proba
to predict class labels and not probabilities. These can be compared against the true class labels for computing the ROC curve. Incidentally, this also removes the option to plot a second curve - you only get one curve for the classifier, not one for each class.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With