I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the <code>cv.glm()</code> function in the <code>boot</code> package, although I've read a lot of help files. When I provide the following formula: <pre class="prettyprint"><code>library(boot) cv.glm(data, glmfit, K=10) </code></pre> Does the "data" argument here refer to the whole dataset or only to the test set? The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!). Unfortunately <code>?cv.glm</code> explains it in a foggy way: <blockquote> data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response </blockquote> My other question would be about the <code>$delta[1]</code> result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold? Here's what my script looks like: <pre class="prettyprint"><code>##data partitioning sub <- sample(nrow(data), floor(nrow(x) * 0.9)) training <- data[sub, ] testing <- data[-sub, ] ##model building model <- glm(formula = groupcol ~ var1 + var2 + var3, family = "binomial", data = training) ##cross-validation cv.glm(testing, model, K=10) </code></pre>

I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package: <pre class="prettyprint"><code>#Randomly shuffle the data yourData<-yourData[sample(nrow(yourData)),] #Create 10 equally size folds folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE) #Perform 10 fold cross validation for(i in 1:10){ #Segement your data by fold using the which() function testIndexes <- which(folds==i,arr.ind=TRUE) testData <- yourData[testIndexes, ] trainData <- yourData[-testIndexes, ] #Use test and train data partitions however you desire... } </code></pre>

Cross validation for glm() models

Tags:

r

partitioning

glm

prediction

cross-validation

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula:

library(boot)
cv.glm(data, glmfit, K=10)

Does the "data" argument here refer to the whole dataset or only to the test set?

The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).

Unfortunately ?cv.glm explains it in a foggy way:

data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response

My other question would be about the $delta[1] result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?

Here's what my script looks like:

##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]

##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
        family = "binomial", data = training)

##cross-validation
cv.glm(testing, model, K=10)

615

asked Jan 27 '14 11:01

Error404

1 Answers

I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:

#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

#Perform 10 fold cross validation
for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- yourData[testIndexes, ]
    trainData <- yourData[-testIndexes, ]
    #Use test and train data partitions however you desire...
}

128

answered Sep 24 '22 11:09

Jake Drew

Related questions
                            
                                Plot background colour in gradient
                            
                                Combining 3 arrays by row number
                            
                                subsetting data.frame without column names
                            
                                Using strsplit() in R, ignoring anything in parentheses
                            
                                Why is Date is being returned as type 'double'?
                            
                                Incorporating interactive shiny apps into Rmarkdown document for blogdown Hugo blog
                            
                                no visible global function definition for ':='
                            
                                R - how do I declare a vector of Date?
                            
                                What 1-2 letter object names conflict with existing R objects?
                            
                                Sequence length encoding using R
                            
                                debugging a function in R that was not exported by a package
                            
                                Order Stacked Bar Graph in ggplot [duplicate]
                            
                                Modifying the shape for a subset of points with ggplot2
                            
                                Predicted values for logistic regression from glm and stat_smooth in ggplot2 are different
                            
                                handling special characters e.g. accents in R
                            
                                R: unexpected results from p.adjust (FDR)
                            
                                tryCatch does not catch an error if called though RScript
                            
                                Why does `a ^ b` return a numeric when both `a` and `b` are integers?
                            
                                R error which says "Models were not all fitted to the same size of dataset"
                            
                                Rscript could not find function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With