Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cross validation for glm() models

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula:

library(boot)
cv.glm(data, glmfit, K=10)

Does the "data" argument here refer to the whole dataset or only to the test set?

The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).

Unfortunately ?cv.glm explains it in a foggy way:

data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response

My other question would be about the $delta[1] result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?

Here's what my script looks like:

##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]

##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
        family = "binomial", data = training)

##cross-validation
cv.glm(testing, model, K=10)
like image 615
Error404 Avatar asked Jan 27 '14 11:01

Error404


People also ask

How does CV GLM () function work?

The cv. glm() function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. In this case the numbers are identical (up to two decimal places) and correspond to the LOOCV statistic: our cross-validation estimate for the test error is approximately 24.23.

Which model is used in cross-validation?

This method, also known as Monte Carlo cross-validation, creates multiple random splits of the dataset into training and validation data. For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. The results are then averaged over the splits.

Does cross-validation fit the model?

Cross-validating is repeated model fitting. Each fit is done on a (major) portion of the data and is tested on the portion of the data that was left out during fitting. This is repeated until every observation is used for testing.

What is the syntax of GLM () in R?

GLM Function Syntax: glm (formula, family, data, weights, subset, Start=null, model=TRUE,method=””…) Here Family types (include model types) includes binomial, Poisson, Gaussian, gamma, quasi. Each distribution performs a different usage and can be used in either classification and prediction.


1 Answers

I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:

#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

#Perform 10 fold cross validation
for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- yourData[testIndexes, ]
    trainData <- yourData[-testIndexes, ]
    #Use test and train data partitions however you desire...
}
like image 128
Jake Drew Avatar answered Sep 24 '22 11:09

Jake Drew