Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Screening (multi)collinearity in a regression model

I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors.

One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different.

Of course, we must always bear in mind the specific context/goal of the analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.

like image 602
aL3xa Avatar asked Jun 15 '10 02:06

aL3xa


People also ask

Which test is used for multicollinearity?

Statistical tools that are often used to test for multicollinearity disorders are the variance inflation factor (VIF), Pearson correlations between independent variables, or by looking at the eigenvalues.

Can you run a regression with multicollinearity?

Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. This means that an independent variable can be predicted from another independent variable in a regression model.

What are the two important test for multicollinearity?

Among all these tests, Pearson's coefficient and VIF are the most used tests for examining the presence of multicollinearity.


3 Answers

The kappa() function can help. Here is a simulated example:

> set.seed(42) > x1 <- rnorm(100) > x2 <- rnorm(100) > x3 <- x1 + 2*x2 + rnorm(100)*0.0001    # so x3 approx a linear comb. of x1+x2 > mm12 <- model.matrix(~ x1 + x2)        # normal model, two indep. regressors > mm123 <- model.matrix(~ x1 + x2 + x3)  # bad model with near collinearity > kappa(mm12)                            # a 'low' kappa is good [1] 1.166029 > kappa(mm123)                           # a 'high' kappa indicates trouble [1] 121530.7 

and we go further by making the third regressor more and more collinear:

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear > mm124 <- model.matrix(~ x1 + x2 + x4) > kappa(mm124) [1] 13955982 > x5 <- x1 + 2*x2                        # now x5 is linear comb of x1,x2 > mm125 <- model.matrix(~ x1 + x2 + x5) > kappa(mm125) [1] 1.067568e+16 >  

This used approximations, see help(kappa) for details.

like image 53
Dirk Eddelbuettel Avatar answered Oct 22 '22 08:10

Dirk Eddelbuettel


Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity. Other methods, apart from condition number, include:

1) the determinant of the covariance matrix which ranges from 0 (Perfect Collinearity) to 1 (No Collinearity)

# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09

2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity

> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184

> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09

3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.

like image 20
George Dontas Avatar answered Oct 22 '22 07:10

George Dontas


You might like Vito Ricci's Reference Card "R Functions For Regression Analysis" http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

It succinctly lists many useful regression related functions in R including diagnostic functions. In particular, it lists the vif function from the car package which can assess multicollinearity. http://en.wikipedia.org/wiki/Variance_inflation_factor

Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo package: http://prof.beuth-hochschule.de/groemping/relaimpo/

like image 40
Jeromy Anglim Avatar answered Oct 22 '22 07:10

Jeromy Anglim