I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors.
One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different.
Of course, we must always bear in mind the specific context/goal of the analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.
Statistical tools that are often used to test for multicollinearity disorders are the variance inflation factor (VIF), Pearson correlations between independent variables, or by looking at the eigenvalues.
Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. This means that an independent variable can be predicted from another independent variable in a regression model.
Among all these tests, Pearson's coefficient and VIF are the most used tests for examining the presence of multicollinearity.
The kappa()
function can help. Here is a simulated example:
> set.seed(42) > x1 <- rnorm(100) > x2 <- rnorm(100) > x3 <- x1 + 2*x2 + rnorm(100)*0.0001 # so x3 approx a linear comb. of x1+x2 > mm12 <- model.matrix(~ x1 + x2) # normal model, two indep. regressors > mm123 <- model.matrix(~ x1 + x2 + x3) # bad model with near collinearity > kappa(mm12) # a 'low' kappa is good [1] 1.166029 > kappa(mm123) # a 'high' kappa indicates trouble [1] 121530.7
and we go further by making the third regressor more and more collinear:
> x4 <- x1 + 2*x2 + rnorm(100)*0.000001 # even more collinear > mm124 <- model.matrix(~ x1 + x2 + x4) > kappa(mm124) [1] 13955982 > x5 <- x1 + 2*x2 # now x5 is linear comb of x1,x2 > mm125 <- model.matrix(~ x1 + x2 + x5) > kappa(mm125) [1] 1.067568e+16 >
This used approximations, see help(kappa)
for details.
Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity
. Other methods, apart from condition number, include:
1) the determinant of the covariance matrix which ranges from 0 (Perfect Collinearity) to 1 (No Collinearity)
# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09
2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity
> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184
> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09
3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern
. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.
You might like Vito Ricci's Reference Card "R Functions For Regression Analysis" http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf
It succinctly lists many useful regression related functions in R including diagnostic functions.
In particular, it lists the vif
function from the car
package which can assess multicollinearity.
http://en.wikipedia.org/wiki/Variance_inflation_factor
Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo
package: http://prof.beuth-hochschule.de/groemping/relaimpo/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With