I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm()
function in the boot
package, although I've read a lot of help files. When I provide the following formula:
library(boot)
cv.glm(data, glmfit, K=10)
Does the "data" argument here refer to the whole dataset or only to the test set?
The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).
Unfortunately ?cv.glm
explains it in a foggy way:
data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response
My other question would be about the $delta[1]
result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?
Here's what my script looks like:
##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]
##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
family = "binomial", data = training)
##cross-validation
cv.glm(testing, model, K=10)
The cv. glm() function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. In this case the numbers are identical (up to two decimal places) and correspond to the LOOCV statistic: our cross-validation estimate for the test error is approximately 24.23.
This method, also known as Monte Carlo cross-validation, creates multiple random splits of the dataset into training and validation data. For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. The results are then averaged over the splits.
Cross-validating is repeated model fitting. Each fit is done on a (major) portion of the data and is tested on the portion of the data that was left out during fitting. This is repeated until every observation is used for testing.
GLM Function Syntax: glm (formula, family, data, weights, subset, Start=null, model=TRUE,method=””…) Here Family types (include model types) includes binomial, Poisson, Gaussian, gamma, quasi. Each distribution performs a different usage and can be used in either classification and prediction.
I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data partitions however you desire...
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With