Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove highly correlated variables

I have a huge dataframe 5600 X 6592 and I want to remove any variables that are correlated to each other more than 0.99 I do know how to do this the long way, step by step i.e. forming a correlation matrix, rounding the values, removing similar ones and use the indexing to get my "reduced" data again.

cor(mydata) mydata <- round(mydata,2) mydata <- mydata[,!duplicated (mydata)] ## then do the indexing... 

I would like to know if this could be done in short command, or some advanced function. I'm learning how to make use of the powerful tools in the R language, which avoids such long unnecessary commands

I was thinking of something like

mydata <- mydata[, which(apply(mydata, 2, function(x) !duplicated(round(cor(x),2))))] 

Sorry I know the above command doesn't work, but I hope I would be able to do this.

a play-data that applies to the question:

mydata <- structure(list(V1 = c(1L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L,  78L, 687L, 378L, 378L, 34L, 53L, 43L), V2 = c(2L, 2L, 5L, 4L,  366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L,  41L), V3 = c(10L, 20L, 10L, 20L, 10L, 20L, 1L, 0L, 1L, 2010L,  20L, 10L, 10L, 10L, 10L, 10L), V4 = c(2L, 10L, 31L, 2L, 2L, 5L,  2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 1L), V5 = c(4L, 10L, 31L,  2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 3L)), .Names = c("V1",  "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA,  -16L)) 

Many thanks

like image 883
Error404 Avatar asked Aug 16 '13 14:08

Error404


People also ask

Should highly correlated variables be removed?

In a more general situation, when you have two independent variables that are very highly correlated, you definitely should remove one of them because you run into the multicollinearity conundrum and your regression model's regression coefficients related to the two highly correlated variables will be unreliable.

Should we remove highly correlated variables before PCA?

Hi Yong, PCA is a way to deal with highly correlated variables, so there is no need to remove them. If N variables are highly correlated than they will all load out on the SAME Principal Component (Eigenvector), not different ones.

How do you remove a correlation from a variable?

Popular Answers (1)You can't "remove" a correlation. That's like saying your data analytic plan will remove the relationship between sunrise and the lightening of the sky. I think your problem is that you are using predictors that are highly correlated with one another.


1 Answers

I'm sure there are many ways to do this and certainly some better than this, but this should work. I basically just set the upper triangle to be zero and then remove any rows that have values over 0.99.

tmp <- cor(data) tmp[upper.tri(tmp)] <- 0 diag(tmp) <- 0  # Above two commands can be replaced with  # tmp[!lower.tri(tmp)] <- 0    data.new <-    data[, !apply(tmp, 2, function(x) any(abs(x) > 0.99, na.rm = TRUE))] head(data.new)     V2 V3 V5 1   2 10  4 2   2 20 10 3   5 10 31 4   4 20  2 5 366 10  2 6  65 20  5 
like image 179
David Avatar answered Oct 07 '22 12:10

David