Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compute correlations between all columns in R and detect highly correlated variables

Tags:

r

correlation

I have a big dataset with 100 variables and 3000 observations. I want to detect those variables (columns) which are highly correlated or redundant and so remove the dimensonality in the dataframe. I tried this but it calculates only the correlation between one column and the others; and I always get an error message

for(i in 1:ncol(predicteurs)){
correlations <- cor(predicteurs[,i],predicteurs[,2])
names(correlations[which.max(abs(correlations))])
}

  Warning messages:
 1: In cor(predicteurs[, i], predicteurs[, 2]) :
the standard deviation is zero
  2: In cor(predicteurs[, i], predicteurs[, 2]) :
 the standard deviation is zero

Can anyone help me?

like image 988
Charlotte Avatar asked Mar 09 '14 13:03

Charlotte


People also ask

How do you calculate multiple correlations in R?

The easiest way to calculate the multiple correlation coefficient (i.e. the correlation between two or more variables on the one hand, and one variable on the other) is to create a multiple linear regression (predicting the values of one variable treated as dependent from the values of two or more variables treated as ...

How do you find highly correlated variables?

If the value is 0, the two variables are independent and there is no correlation. If the measure is extremely close to one of these values, it indicates a linear relationship and highly correlated with each other. This means a change in one variable is associated with a significant change in other variables.


1 Answers

Updated for newer tidyverse packages..

I would try gathering a correlation matrix.

# install.packages(c('tibble', 'dplyr', 'tidyr'))
library(tibble)
library(dplyr)
library(tidyr)

d <- data.frame(x1=rnorm(10),
                x2=rnorm(10),
                x3=rnorm(10))

d2 <- d %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1)

  var1 var2       value
1   x1   x1  1.00000000
2   x1   x2 -0.05936703
3   x1   x3 -0.37479619
4   x2   x1 -0.05936703
5   x2   x2  1.00000000
6   x2   x3  0.43716004
7   x3   x1 -0.37479619
8   x3   x2  0.43716004
9   x3   x3  1.00000000

# .5 is an arbitrary number
filter(d2, value > .5)

# remove duplicates
d2 %>%
  mutate(var_order = paste(var1, var2) %>%
           strsplit(split = ' ') %>%
           map_chr( ~ sort(.x) %>% 
                      paste(collapse = ' '))) %>%
  mutate(cnt = 1) %>%
  group_by(var_order) %>%
  mutate(cumsum = cumsum(cnt)) %>%
  filter(cumsum != 2) %>%
  ungroup %>%
  select(-var_order, -cnt, -cumsum)

  var1  var2   value
1 x1    x1     1     
2 x1    x2    -0.0594
3 x1    x3    -0.375 
4 x2    x2     1     
5 x2    x3     0.437 
6 x3    x3     1     
like image 89
maloneypatr Avatar answered Oct 25 '22 06:10

maloneypatr