I am using the roc_auc_score function from scikit-learn to evaluate my model performances. Howver, I get differents values whether I use predict() or predict_proba()
p_pred = forest.predict_proba(x_test)
y_test_predicted= forest.predict(x_test)
fpr, tpr, _ = roc_curve(y_test, p_pred[:, 1])
roc_auc = auc(fpr, tpr)
roc_auc_score(y_test,y_test_predicted) # = 0.68
roc_auc_score(y_test, p_pred[:, 1]) # = 0.93
Could advise on that please ?
Thanks in advance
First look at the difference between predict and predict_proba. The former predicts the class for the feature set where as the latter predicts the probabilities of various classes.
You are seeing the effect of rounding error that is implicit in the binary format of y_test_predicted. y_test_predicted is comprised of 1's and 0's where as p_pred is comprised of floating point values between 0 and 1. The roc_auc_score routine varies the threshold value and generates the true positive rate and false positive rate, so the score looks quite different.
Consider the case where:
y_test = [ 1, 0, 0, 1, 0, 1, 1]
p_pred = [.6,.4,.6,.9,.2,.7,.4]
y_test_predicted = [ 1, 0, 1, 1, 0, 1, 0]
Note that the ROC curve is generated by considering all cutoff thresholds. Now consider a threshold of 0.65...
The p_pred case gives:
TPR=0.5, FPR=0,
and the y_test_predicted case gives:
TPR=.75 FPR=.25.
You can probably see that if these two points are different, then the area under the two curves will be quite different too.
But to really understand it, I suggest looking at the ROC curves themselves to help understand this difference.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With