Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regression for a Rate variable in R

I was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable.

However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a count variable, but a proportion between 0 and 1. This is considered the "proportion of enrollment" in a program.

This "rate" (students/population) is no longer Poisson, but is certainly not normal either. So, I'm a bit lost as to the appropriate distribution, and subsequent model to represent it.

A log normal distribution seems to fit this rate parameter well, however I have many 0 values, so it won't actually fit.

Any suggestions on the best form of distribution for this new parameter, and how to model it in R?

Thanks!

like image 597
Noah Avatar asked Apr 16 '13 20:04

Noah


People also ask

When should I use Poisson regression?

Poisson Regression models are best used for modeling events where the outcomes are counts. Or, more specifically, count data: discrete data with non-negative integer values that count something, like the number of times an event occurs during a given timeframe or the number of people in line at the grocery store.

What is the difference between linear regression and Poisson regression?

There is a fundamental difference between a classical linear regression model and the specification for the conditional mean in the Poisson regression model, in that the latter does not contain a random error term (in its “pure” form).

How do you predict a variable in R?

The predict() function in R is used to predict the values based on the input data. All the modeling aspects in the R program will make use of the predict() function in their own way, but note that the functionality of the predict() function remains the same irrespective of the case.


1 Answers

As suggested in the comments you could keep the Poisson model and do it with an offset:

glm(response~predictor1+predictor2+predictor3+ ... + offset(log(population),
     family=poisson,data=...)

Or you could use a binomial GLM, either

glm(cbind(response,pop_size-response) ~ predictor1 + ... , family=binomial,
        data=...)

or

glm(response/pop_size ~ predictor1 + ... , family=binomial,
        weights=pop_size,
        data=...)

The latter form is sometimes more convenient, although less widely used. Be aware that in general switching from Poisson to binomial will change the link function from log to logit, although you can use family=binomial(link="log")) if you prefer.

Zero-inflation might be easier to model with the Poisson + offset combination (I'm not sure if the pscl package, the most common approach to ZIP, handles offsets, but I think it does), which will be more commonly available than a zero-inflated binomial model.

I think glmmADMB will do a zero-inflated binomial model, but I haven't tested it.

like image 182
Ben Bolker Avatar answered Oct 18 '22 11:10

Ben Bolker