Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why when I use GridSearchCV with roc_auc scoring, the score is different for grid_search.score(X,y) and roc_auc_score(y, y_predict)?

I am using stratified 10-fold cross validation to find model that predicts y (binary outcome) from X (X has 34 labels) with the highest auc. I set the GridSearchCV:

log_reg = LogisticRegression()
parameter_grid = {'penalty' : ["l1", "l2"],'C': np.arange(0.1, 3, 0.1),}
cross_validation = StratifiedKFold(n_splits=10,shuffle=True,random_state=100)
grid_search = GridSearchCV(log_reg, param_grid = parameter_grid,scoring='roc_auc',
                          cv = cross_validation)

And then do the cross-validation:

grid_search.fit(X, y)
y_pr=grid_search.predict(X)

I do not understand the following: why grid_search.score(X,y) and roc_auc_score(y, y_pr) give different results (the former is 0.74 and the latter is 0.63)? Why do not these commands do the same thing in my case?

like image 248
huda95x Avatar asked Mar 02 '18 01:03

huda95x


People also ask

What is Roc_auc score?

AUC means area under the curve so to speak about ROC AUC score we need to define ROC curve first. It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot it on one chart.

How do you interpret ROC AUC scores?

The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.

How do you calculate AUC score in Python?

ROC Curves and AUC in Python The AUC for the ROC can be calculated using the roc_auc_score() function. Like the roc_curve() function, the AUC function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class.

What is roc_auc_score in Python?

roc_auc_score is defined as the area under the ROC curve, which is the curve having False Positive Rate on the x-axis and True Positive Rate on the y-axis at all classification thresholds. But it's impossible to calculate FPR and TPR for regression methods, so we cannot take this road.


1 Answers

This is due to different initialization of roc_auc when used in GridSearchCV.

Look at the source code here

roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
                             needs_threshold=True)

Observe the third parameter needs_threshold. When true, it will require the continous values for y_pred such as probabilities or confidence scores which in gridsearch will be calculated from log_reg.decision_function().

When you explicitly call roc_auc_score with y_pr, you are using .predict() which will output the resultant predicted class labels of the data and not probabilities. That should account for the difference.

Try :

y_pr=grid_search.decision_function(X)
roc_auc_score(y, y_pr)

If still not same results, please update the question with complete code and some sample data.

like image 86
Vivek Kumar Avatar answered Sep 27 '22 22:09

Vivek Kumar