I am interested in using cross validation (leave-one-out or K-folds) to test several different negative binomial GLMs that I have created. I am using the glm.nb()
function from MASS
to run negative binomial regression.
My question is whether I can use the cv.glm()
from boot
to test this model. I am leaning towards no, but wondered if anyone knew a function that would let me perform K-folds validation (or leave one out). Or, perhaps cv.glm()
is perfectly valid for negative binomial.
Here is some data from an online example to help out. I had thought that cross-validation results ($delta
) should be bounded between 0 and 1, but this is not the case below (and possibly indicative of something gone awry)
http://www.ats.ucla.edu/stat/r/dae/nbreg.htm
I've found a few questions about interpreting the output of cv.glm()
, but not specifically about how to do cross validation with negative binomial models in R
require(foreign)
require(ggplot2)
require(MASS)
require(boot)
dat <- read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
dat <- within(dat, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic","Vocational"))
id <- factor(id)
})
summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))
#This code will run, but is it valid for negative binomial GLM models?
cv.glm(dat,m1,K=10)
[I'm not sure if this question belongs here or on Cross Validated...]
Using cv.glm()
is valid for any glm
object. It does not use any update formula to compute cross-validation score, but simply drops a group, fit a model for reduced dataset, and compute a prediction error. Such is done many times to obtain a final average. In this way, the leave-1-out cross-validation (default) is much more costly than K-fold cross-validation.
Why is it valid for any glm
object? How does it know what model it should fit? Well, you tell it by passing in your fitted model m1
. Have a look at:
m1$call
#glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156,
# link = log)
When cv.glm
drops data and refits the model for reduced dataset, it uses such call. So you are definitely fitting a negative binomial model every time.
I am trying to interpret the results from the
cv.glm()$delta
. In the example above, the model is predicting "Days absent from school", and the output of$delta
is ~42.1. Would I interpret this as "on average, this model has a prediction error of 42 days"?
By default, cv.glm()
returns MSE, defined by the cost
argument:
cost = function(y, yhat) mean((y - yhat)^2
So 42.1 is really prediction variance. You need RMSE, i.e., sqrt(42.1) = 6.5
, for prediction error / standard deviation. This has the same unit as your response variable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With