I am using stratified 10-fold cross validation to find model that predicts y (binary outcome) from X (X has 34 labels) with the highest auc. I set the GridSearchCV:
log_reg = LogisticRegression()
parameter_grid = {'penalty' : ["l1", "l2"],'C': np.arange(0.1, 3, 0.1),}
cross_validation = StratifiedKFold(n_splits=10,shuffle=True,random_state=100)
grid_search = GridSearchCV(log_reg, param_grid = parameter_grid,scoring='roc_auc',
cv = cross_validation)
And then do the cross-validation:
grid_search.fit(X, y)
y_pr=grid_search.predict(X)
I do not understand the following:
why grid_search.score(X,y)
and roc_auc_score(y, y_pr)
give different results (the former is 0.74 and the latter is 0.63)? Why do not these commands do the same thing in my case?
AUC means area under the curve so to speak about ROC AUC score we need to define ROC curve first. It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot it on one chart.
The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
ROC Curves and AUC in Python The AUC for the ROC can be calculated using the roc_auc_score() function. Like the roc_curve() function, the AUC function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class.
roc_auc_score is defined as the area under the ROC curve, which is the curve having False Positive Rate on the x-axis and True Positive Rate on the y-axis at all classification thresholds. But it's impossible to calculate FPR and TPR for regression methods, so we cannot take this road.
This is due to different initialization of roc_auc when used in GridSearchCV.
Look at the source code here
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
Observe the third parameter needs_threshold
. When true, it will require the continous values for y_pred
such as probabilities or confidence scores which in gridsearch will be calculated from log_reg.decision_function()
.
When you explicitly call roc_auc_score
with y_pr
, you are using .predict()
which will output the resultant predicted class labels of the data and not probabilities. That should account for the difference.
Try :
y_pr=grid_search.decision_function(X)
roc_auc_score(y, y_pr)
If still not same results, please update the question with complete code and some sample data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With