Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set a weighted least-squares in r for heteroscedastic data?

I'm running a regression on census data where my dependent variable is life expectancy and I have eight independent variables. The data is aggregated be cities, so I have many thousand observations.

My model is somewhat heteroscedastic though. I want to run a weighted least-squares where each observation is weighted by the city’s population. In this case, it would mean that I want to weight the observations by the inverse of the square root of the population. It’s unclear to me, however, what would be the best syntax. Currently, I have:

Model=lm(…,weights=(1/population))

Is that correct? Or should it be:

Model=lm(…,weights=(1/sqrt(population)))

(I found this question here: Weighted Least Squares - R but it does not clarify how R interprets the weights argument.)

like image 299
Lucas De Abreu Maia Avatar asked Aug 15 '13 19:08

Lucas De Abreu Maia


2 Answers

From ?lm: "weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used." R doesn't do any further interpretation of the weights argument.

So, if what you want to minimize is the sum of (the squared distance from each point to the fit line * 1/sqrt(population) then you want ...weights=(1/sqrt(population)). If you want to minimize the sum of (the squared distance from each point to the fit line * 1/population) then you want ...weights=1/population.

As to which of those is most appropriate... that's a question for CrossValidated!

like image 189
Drew Steen Avatar answered Oct 15 '22 20:10

Drew Steen


To answer your question, Lucas, I think you want weights=(1/population). R parameterizes the weights as inversely proportional to the variances, so specifying the weights this way amounts to assuming that the variance of the error term is proportional to the population of the city, which is a common assumption in this setting.

But check the assumption! If the variance of the error term is indeed proportional to the population size, then if you divide each residual by the square root of its corresponding sample size, the residuals should have constant variance. Remember, dividing a random variable by a constant results in the variance being divided by the square of that constant.

Here's how you can check this: Obtain residuals from the regression by

residuals = lm(..., weights = 1/population)$residuals

Then divide the residuals by the square roots of the population variances:

standardized_residuals = residuals/sqrt(population)

Then compare the sample variance among the residuals corresponding to the bottom half of population sizes:

variance1 = var(standardized_residuals[population < median(population)])

to the sample variance among the residuals corresponding to the upper half of population sizes:

variance2 = var(standardized_residuals[population > median(population)])

If these two numbers, variance1 and variance2 are similar, then you're doing something right. If they are drastically different, then maybe your assumption is violated.

like image 35
Scott Powers Avatar answered Oct 15 '22 20:10

Scott Powers