Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find significant correlations in a large dataset

Tags:

r

correlation

I'm using R. My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.

I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b. Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.

I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?

like image 583
user3279779 Avatar asked Feb 06 '14 13:02

user3279779


People also ask

How do you determine significant correlation?

To determine whether the correlation between variables is significant, compare the p-value to your significance level. Usually, a significance level (denoted as α or alpha) of 0.05 works well. An α of 0.05 indicates that the risk of concluding that a correlation exists—when, actually, no correlation exists—is 5%.

How large should a correlation be statistically significant?

In most research the threshold to what we consider statistically significant is a p-value of 0.05 or below and it's called the significance level α. So we can set our significance level to 0.05 (α =0.05) and find the P-value.

Is 0.5 A large correlation?

Correlation coefficients whose magnitude are between 0.5 and 0.7 indicate variables which can be considered moderately correlated. Correlation coefficients whose magnitude are between 0.3 and 0.5 indicate variables which have a low correlation.


2 Answers

You can use the function rcorr from the package Hmisc.

Using the same demo data from Richie:

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

Then:

library(Hmisc)
correlations <- rcorr(as.matrix(the_data))

To access the p-values:

correlations$P

To visualize you can use the package corrgram

library(corrgram)
corrgram(the_data)

Which will produce: enter image description here

like image 138
Carlos Cinelli Avatar answered Nov 03 '22 23:11

Carlos Cinelli


In order to print a list of the significant correlations (p < 0.05), you can use the following.

  1. Using the same demo data from @Richie:

    m <- 40
    n <- 80
    the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
    colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
    
  2. Install Hmisc

    install.packages("Hmisc")
    
  3. Import library and find the correlations (@Carlos)

    library(Hmisc)
    correlations <- rcorr(as.matrix(the_data))
    
  4. Loop over the values printing the significant correlations

    for (i in 1:m){
      for (j in 1:m){
        if ( !is.na(correlations$P[i,j])){
          if ( correlations$P[i,j] < 0.05 ) {
            print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
          }
        }
      }
    }
    

Warning

You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.

like image 25
toto_tico Avatar answered Nov 04 '22 01:11

toto_tico