I have a huge dataframe 5600 X 6592 and I want to remove any variables that are correlated to each other more than 0.99 I do know how to do this the long way, step by step i.e. forming a correlation matrix, rounding the values, removing similar ones and use the indexing to get my "reduced" data again. <pre class="prettyprint"><code>cor(mydata) mydata <- round(mydata,2) mydata <- mydata[,!duplicated (mydata)] ## then do the indexing... </code></pre> I would like to know if this could be done in short command, or some advanced function. I'm learning how to make use of the powerful tools in the R language, which avoids such long unnecessary commands I was thinking of something like <pre class="prettyprint"><code>mydata <- mydata[, which(apply(mydata, 2, function(x) !duplicated(round(cor(x),2))))] </code></pre> Sorry I know the above command doesn't work, but I hope I would be able to do this. a play-data that applies to the question: <pre class="prettyprint"><code>mydata <- structure(list(V1 = c(1L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 43L), V2 = c(2L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 41L), V3 = c(10L, 20L, 10L, 20L, 10L, 20L, 1L, 0L, 1L, 2010L, 20L, 10L, 10L, 10L, 10L, 10L), V4 = c(2L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 1L), V5 = c(4L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 3L)), .Names = c("V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA, -16L)) </code></pre> Many thanks

I'm sure there are many ways to do this and certainly some better than this, but this should work. I basically just set the upper triangle to be zero and then remove any rows that have values over 0.99. <pre class="prettyprint"><code>tmp <- cor(data) tmp[upper.tri(tmp)] <- 0 diag(tmp) <- 0 # Above two commands can be replaced with # tmp[!lower.tri(tmp)] <- 0 data.new <- data[, !apply(tmp, 2, function(x) any(abs(x) > 0.99, na.rm = TRUE))] head(data.new) V2 V3 V5 1 2 10 4 2 2 20 10 3 5 10 31 4 4 20 2 5 366 10 2 6 65 20 5 </code></pre>

Remove highly correlated variables

Tags:

function

r

subset

correlation

I have a huge dataframe 5600 X 6592 and I want to remove any variables that are correlated to each other more than 0.99 I do know how to do this the long way, step by step i.e. forming a correlation matrix, rounding the values, removing similar ones and use the indexing to get my "reduced" data again.

cor(mydata) mydata <- round(mydata,2) mydata <- mydata[,!duplicated (mydata)] ## then do the indexing...

I would like to know if this could be done in short command, or some advanced function. I'm learning how to make use of the powerful tools in the R language, which avoids such long unnecessary commands

I was thinking of something like

mydata <- mydata[, which(apply(mydata, 2, function(x) !duplicated(round(cor(x),2))))]

Sorry I know the above command doesn't work, but I hope I would be able to do this.

a play-data that applies to the question:

mydata <- structure(list(V1 = c(1L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L,  78L, 687L, 378L, 378L, 34L, 53L, 43L), V2 = c(2L, 2L, 5L, 4L,  366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L,  41L), V3 = c(10L, 20L, 10L, 20L, 10L, 20L, 1L, 0L, 1L, 2010L,  20L, 10L, 10L, 10L, 10L, 10L), V4 = c(2L, 10L, 31L, 2L, 2L, 5L,  2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 1L), V5 = c(4L, 10L, 31L,  2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 3L)), .Names = c("V1",  "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA,  -16L))

Many thanks

883

asked Aug 16 '13 14:08

Error404

1 Answers

I'm sure there are many ways to do this and certainly some better than this, but this should work. I basically just set the upper triangle to be zero and then remove any rows that have values over 0.99.

tmp <- cor(data) tmp[upper.tri(tmp)] <- 0 diag(tmp) <- 0  # Above two commands can be replaced with  # tmp[!lower.tri(tmp)] <- 0    data.new <-    data[, !apply(tmp, 2, function(x) any(abs(x) > 0.99, na.rm = TRUE))] head(data.new)     V2 V3 V5 1   2 10  4 2   2 20 10 3   5 10 31 4   4 20  2 5 366 10  2 6  65 20  5

179

answered Oct 07 '22 12:10

David

Related questions
                            
                                Saving and loading a model in R
                            
                                Changing chunk background color in RMarkdown
                            
                                Extract bz2 file in R
                            
                                How to dplyr rename a column, by column index?
                            
                                How do I add a URL to R markdown?
                            
                                dplyr - using mutate() like rowmeans()
                            
                                Download a file from HTTPS using download.file()
                            
                                R: speeding up "group by" operations
                            
                                Converting excel DateTime serial number to R DateTime
                            
                                Can I combine a list of similar dataframes into a single dataframe? [duplicate]
                            
                                How to extract the row with min or max values?
                            
                                Write a dataframe to csv file with value of NA as blank
                            
                                Overriding "Variables not shown" in dplyr, to display all columns from df
                            
                                How to change the font size and color of x-axis and y-axis label in a scatterplot with plot function in R?
                            
                                How to find row number of a value in R code
                            
                                How do I group my date variable into month/year in R?
                            
                                xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train
                            
                                Copy an R data.frame to an Excel spreadsheet
                            
                                Plot random effects from lmer (lme4 package) using qqmath or dotplot: How to make it look fancy?
                            
                                R: "Unary operator error" from multiline ggplot2 command

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With