I am trying to estimate a logistic regression, using the 10-fold cross-validation.
#import libraries library(car); library(caret); library(e1071); library(verification) #data import and preparation data(Chile) chile <- na.omit(Chile) #remove "na's" chile <- chile[chile$vote == "Y" | chile$vote == "N" , ] #only "Y" and "N" required chile$vote <- factor(chile$vote) #required to remove unwanted levels chile$income <- factor(chile$income) # treat income as a factor
Goal is to estimate a glm - model that predicts to outcome of vote "Y" or "N" depended on relevant explanatory variables and, based on the final model, compute a confusion matrix and ROC curve to grasp the models behaviour for different threshold levels.
Model selection leads to:
res.chileIII <- glm(vote ~ sex + education + statusquo , family = binomial(), data = chile) #prediction chile.pred <- predict.glm(res.chileIII, type = "response")
generates:
> head(chile.pred) 1 2 3 4 5 6 0.974317861 0.008376988 0.992720134 0.095014139 0.040348115 0.090947144
to compare the observed with estimation:
chile.v <- ifelse(chile$vote == "Y", 1, 0) #to compare the two arrays chile.predt <- function(t) ifelse(chile.pred > t , 1,0) #t is the threshold for which the confusion matrix shall be computed
confusion matrix for t = 0.3:
confusionMatrix(chile.predt(0.3), chile.v) > confusionMatrix(chile.predt(0.3), chile.v) Confusion Matrix and Statistics Reference Prediction 0 1 0 773 44 1 94 792 Accuracy : 0.919 95% CI : (0.905, 0.9315) No Information Rate : 0.5091 P-Value [Acc > NIR] : < 2.2e-16
and the Roc-curve:
roc.plot(chile.v, chile.pred)
which seems as a reasonable model.
Now instead of using the "normal" predict.glm() function I want to test out the performance difference to a 10-fold cross-validation estimation.
tc <- trainControl("cv", 10, savePredictions=T) #"cv" = cross-validation, 10-fold fit <- train(chile$vote ~ chile$sex + chile$education + chile$statusquo , data = chile , method = "glm" , family = binomial , trControl = tc) > summary(fit)$coef Estimate Std. Error z value Pr(>|z|) (Intercept) 1.0152702 0.1889646 5.372805 7.752101e-08 `chile$sexM` -0.5742442 0.2022308 -2.839549 4.517738e-03 `chile$educationPS` -1.1074079 0.2914253 -3.799971 1.447128e-04 `chile$educationS` -0.6827546 0.2217459 -3.078996 2.076993e-03 `chile$statusquo` 3.1689305 0.1447911 21.886224 3.514468e-106
all parameters significant.
fitpred <- ifelse(fit$pred$pred == "Y", 1, 0) #to compare with chile.v > confusionMatrix(fitpred,chile.v) Confusion Matrix and Statistics Reference Prediction 0 1 0 445 429 1 422 407 Accuracy : 0.5003 95% CI : (0.4763, 0.5243) No Information Rate : 0.5091 P-Value [Acc > NIR] : 0.7738
which is obviously very different from the previous confusion matrix. My expectation was that the cross validated results should not perform much worse then the first model. However the results show something else.
My assumption is that there is a mistake with the settings of the train() parameters but I can't figure it out what it is.
I would really appreciate some help, thank you in advance.
The glm() function in R can be used to fit generalized linear models. This function is particularly useful for fitting logistic regression models, Poisson regression models, and other complex models. Once we've fit a model, we can then use the predict() function to predict the response value of a new observation.
Implement the train() Function in R The train() method (from the caret library) is used for classification and regression training. It is also used to tune the models by picking the complexity parameters.
Caret is a one-stop solution for machine learning in R. The R package caret has a powerful train function that allows you to fit over 230 different models using one syntax. There are over 230 models included in the package including various tree-based models, neural nets, deep learning and much more.
Resampling options ( trainControl ) One of the most important part of training ML models is tuning parameters. You can use the trainControl function to specify a number of parameters (including sampling parameters) in your model. The object that is outputted from trainControl will be provided as an argument for train .
You are trying to get an idea of the in-sample fit using a confusion matrix. Your first approach using the glm()
function is fine.
The problem with the second approach using train()
lies in the returned object. You are trying to extract the in-sample fitted values from it by fit$pred$pred
. However, fit$pred
does not contain the fitted values that are aligned to chile.v
or chile$vote
. It contains the observations and fitted values of the different (10) folds:
> head(fit$pred) pred obs rowIndex parameter Resample 1 N N 2 none Fold01 2 Y Y 20 none Fold01 3 Y Y 28 none Fold01 4 N N 38 none Fold01 5 N N 55 none Fold01 6 N N 66 none Fold01 > tail(fit$pred) pred obs rowIndex parameter Resample 1698 Y Y 1592 none Fold10 1699 Y N 1594 none Fold10 1700 N N 1621 none Fold10 1701 N N 1656 none Fold10 1702 N N 1671 none Fold10 1703 Y Y 1689 none Fold10
So, due to the randomness of the folds, and because you are predicting 0 or 1, you get an accuracy of roughly 50%.
The in-sample fitted values you are looking for are in fit$finalModel$fitted.values
. Using those:
fitpred <- fit$finalModel$fitted.values fitpredt <- function(t) ifelse(fitpred > t , 1,0) > confusionMatrix(fitpredt(0.3),chile.v) Confusion Matrix and Statistics Reference Prediction 0 1 0 773 44 1 94 792 Accuracy : 0.919 95% CI : (0.905, 0.9315) No Information Rate : 0.5091 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.8381 Mcnemar's Test P-Value : 3.031e-05 Sensitivity : 0.8916 Specificity : 0.9474 Pos Pred Value : 0.9461 Neg Pred Value : 0.8939 Prevalence : 0.5091 Detection Rate : 0.4539 Detection Prevalence : 0.4797 Balanced Accuracy : 0.9195 'Positive' Class : 0
Now the accuracy is around the expected value. Setting the threshold to 0.5 yields about the same accuracy as the estimate from the 10-fold cross validation:
> confusionMatrix(fitpredt(0.5),chile.v) Confusion Matrix and Statistics Reference Prediction 0 1 0 809 64 1 58 772 Accuracy : 0.9284 95% CI : (0.9151, 0.9402) [rest of the output omitted] > fit Generalized Linear Model 1703 samples 7 predictors 2 classes: 'N', 'Y' No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 1533, 1532, 1532, 1533, 1532, 1533, ... Resampling results Accuracy Kappa Accuracy SD Kappa SD 0.927 0.854 0.0134 0.0267
Additionally, regarding your expectation "that the cross validated results should not perform much worse than the first model," please check summary(res.chileIII)
and summary(fit)
. The fitted models and coefficients are exactly the same so they will give the same results.
P.S. I know my answer to this question is late--i.e. this is quite an old question. Is it OK to answer these questions anyway? I am new here and did not find anything about "late answers" in the help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With