I was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable.
However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a count variable, but a proportion between 0 and 1. This is considered the "proportion of enrollment" in a program.
This "rate" (students/population) is no longer Poisson, but is certainly not normal either. So, I'm a bit lost as to the appropriate distribution, and subsequent model to represent it.
A log normal distribution seems to fit this rate parameter well, however I have many 0 values, so it won't actually fit.
Any suggestions on the best form of distribution for this new parameter, and how to model it in R?
Thanks!
Poisson Regression models are best used for modeling events where the outcomes are counts. Or, more specifically, count data: discrete data with non-negative integer values that count something, like the number of times an event occurs during a given timeframe or the number of people in line at the grocery store.
There is a fundamental difference between a classical linear regression model and the specification for the conditional mean in the Poisson regression model, in that the latter does not contain a random error term (in its “pure” form).
The predict() function in R is used to predict the values based on the input data. All the modeling aspects in the R program will make use of the predict() function in their own way, but note that the functionality of the predict() function remains the same irrespective of the case.
As suggested in the comments you could keep the Poisson model and do it with an offset:
glm(response~predictor1+predictor2+predictor3+ ... + offset(log(population),
family=poisson,data=...)
Or you could use a binomial GLM, either
glm(cbind(response,pop_size-response) ~ predictor1 + ... , family=binomial,
data=...)
or
glm(response/pop_size ~ predictor1 + ... , family=binomial,
weights=pop_size,
data=...)
The latter form is sometimes more convenient, although less widely used.
Be aware that in general switching from Poisson to binomial will change the
link function from log to logit, although you can use family=binomial(link="log"))
if you prefer.
Zero-inflation might be easier to model with the Poisson + offset combination (I'm not sure if the pscl
package, the most common approach to ZIP, handles offsets, but I think it does), which will be more commonly available than a zero-inflated binomial model.
I think glmmADMB
will do a zero-inflated binomial model, but I haven't tested it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With