I have a huge dataframe 5600 X 6592 and I want to remove any variables that are correlated to each other more than 0.99 I do know how to do this the long way, step by step i.e. forming a correlation matrix, rounding the values, removing similar ones and use the indexing to get my "reduced" data again.
cor(mydata) mydata <- round(mydata,2) mydata <- mydata[,!duplicated (mydata)] ## then do the indexing...
I would like to know if this could be done in short command, or some advanced function. I'm learning how to make use of the powerful tools in the R language, which avoids such long unnecessary commands
I was thinking of something like
mydata <- mydata[, which(apply(mydata, 2, function(x) !duplicated(round(cor(x),2))))]
Sorry I know the above command doesn't work, but I hope I would be able to do this.
a play-data that applies to the question:
mydata <- structure(list(V1 = c(1L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 43L), V2 = c(2L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 41L), V3 = c(10L, 20L, 10L, 20L, 10L, 20L, 1L, 0L, 1L, 2010L, 20L, 10L, 10L, 10L, 10L, 10L), V4 = c(2L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 1L), V5 = c(4L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 3L)), .Names = c("V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA, -16L))
Many thanks
In a more general situation, when you have two independent variables that are very highly correlated, you definitely should remove one of them because you run into the multicollinearity conundrum and your regression model's regression coefficients related to the two highly correlated variables will be unreliable.
Hi Yong, PCA is a way to deal with highly correlated variables, so there is no need to remove them. If N variables are highly correlated than they will all load out on the SAME Principal Component (Eigenvector), not different ones.
Popular Answers (1)You can't "remove" a correlation. That's like saying your data analytic plan will remove the relationship between sunrise and the lightening of the sky. I think your problem is that you are using predictors that are highly correlated with one another.
I'm sure there are many ways to do this and certainly some better than this, but this should work. I basically just set the upper triangle to be zero and then remove any rows that have values over 0.99.
tmp <- cor(data) tmp[upper.tri(tmp)] <- 0 diag(tmp) <- 0 # Above two commands can be replaced with # tmp[!lower.tri(tmp)] <- 0 data.new <- data[, !apply(tmp, 2, function(x) any(abs(x) > 0.99, na.rm = TRUE))] head(data.new) V2 V3 V5 1 2 10 4 2 2 20 10 3 5 10 31 4 4 20 2 5 366 10 2 6 65 20 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With