I'm using the sklearn
package to build a logistic regression model and then evaluate it. Specifically, I want to do so using cross validation, but can't figure out the right way to do so with the cross_val_score
function.
According to the documentation and some examples I saw, I need to pass the function the model, the features, the outcome, and a scoring method. However, the AUC doesn't need predictions, it needs probabilities, so it can try different threshold values and calculate the ROC curve based on that. So what's the right approach here? This function has 'roc_auc'
as a possible scoring method, so I'm assuming it's compatible with it, I'm just not sure about the right way to use it. Sample code snippet below.
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
features = ['a', 'b', 'c']
outcome = ['d']
X = df[features]
y = df[outcome]
crossval_scores = cross_val_score(LogisticRegression(), X, y, scoring='roc_auc', cv=10)
Basically, I don't understand why I need to pass y
to my cross_val_score
function here, instead of probabilities calculated using X
in a logistic regression model. Does it just do that part on its own?
1.1 Cross validation for logistic regression We can use the same cost function as defined before, but you need to modify it such that there are only two input: observed Y and predicted probability, so that the cv. glm can recognize it for cross-validation with asymmetric cost.
roc_auc_score is defined as the area under the ROC curve, which is the curve having False Positive Rate on the x-axis and True Positive Rate on the y-axis at all classification thresholds. But it's impossible to calculate FPR and TPR for regression methods, so we cannot take this road.
The Area Under the ROC curve (AUC) is an aggregated metric that evaluates how well a logistic regression model classifies positive and negative outcomes at all possible cutoffs. It can range from 0.5 to 1, and the larger it is the better.
All supervised learning methods (including logistic regression) need the true y
values to fit a model.
After fitting a model, we generally want to:
cross_val_score
gives you cross-validated scores of a model's predictions. But to score the predictions it first needs to make the predictions, and to make the predictions it first needs to fit the model, which requires both X
and (true) y
.
cross_val_score
as you note accepts different scoring metrics. So if you chose f1-score
for example, the model predictions generated during cross-val-score
would be class predictions (from the model's predict()
method). And if you chose roc_auc
as your metric, the model predictions used to score the model would be probability predictions (from the model's predict_proba()
method).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With