I am working on a multiclass, highly imbalanced classification problem. I use random forest as base classifier.
I would have to give report of model performance on the evaluation set considering multiple criteria (metrics: precision, recall conf_matrix, roc_auc).
Model train:
rf = RandomForestClassifier(()
rf.fit(train_X, train_y)
To obtain precision/recall and confusion_matrix, I go like:
pred = rf.predict(test_X)
precision = metrics.precision_score(y_test, pred)
recall = metrics.recall_score(y_test, pred)
f1_score = metrics.f1_score(y_test, pred)
confusion_matrix = metrics.confusion_matrix(y_test, pred)
Fine, but then computing roc_auc requires the prediction probability of classes and not the class labels. For that I must further do this:
y_prob = rf.predict_proba(test_X)
roc_auc = metrics.roc_auc_score(y_test, y_prob)
But then I'm worried here that the outcome produced first by rf.predict() may not be consistent with rf.predict_proba() so the roc_auc score I'm reporting. I know that calling predict several times will produce exactly the same result, but I'm concern predict then predict_proba might produce slightly different results, making it inappropriate to discuss together with the metrics above.
If that is the case, is there a way to control this, making sure the class probabilities used by predict() to decide predicted labels are exactly the same when I then call predict_proab?
predict_proba() and predict() are consistent with eachother. In fact, predict uses predict_proba internally as can be seen here in the source code
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With