How to find significant correlations in a large dataset

Tags:

I'm using R. My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.

I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b. Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.

I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?

583

asked Feb 06 '14 13:02

user3279779

2 Answers

You can use the function rcorr from the package Hmisc.

Using the same demo data from Richie:

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

Then:

library(Hmisc)
correlations <- rcorr(as.matrix(the_data))

To access the p-values:

correlations$P

To visualize you can use the package corrgram

library(corrgram)
corrgram(the_data)

Which will produce: enter image description here

138

answered Nov 03 '22 23:11

Carlos Cinelli

In order to print a list of the significant correlations (p < 0.05), you can use the following.

Using the same demo data from @Richie:

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

Install Hmisc
```
install.packages("Hmisc")
```

Import library and find the correlations (@Carlos)

library(Hmisc)
correlations <- rcorr(as.matrix(the_data))

Loop over the values printing the significant correlations

for (i in 1:m){
  for (j in 1:m){
    if ( !is.na(correlations$P[i,j])){
      if ( correlations$P[i,j] < 0.05 ) {
        print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
      }
    }
  }
}

Warning

You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.

answered Nov 04 '22 01:11

toto_tico

Related questions
                            
                                R/ggplot2: smooth on entire dataset while enforcing a ylim cap
                            
                                How do I set column names to lower case for multiple dataframes?
                            
                                How to Connect R with MySQL or how to install RMySQL package?
                            
                                Select a value for based on a highest value in another column
                            
                                C compilation flags from R
                            
                                use R to retrieve public link to dropbox file
                            
                                Convert curl code into R via the RCurl package?
                            
                                Export a polygon from an R plot as a shapefile
                            
                                Change values in row based on a column value r
                            
                                How to recycle colours in a colorbrewer palette using line symbols
                            
                                Load and save single objects to workspaces in R/R-Studio
                            
                                How to plot smoother curves in R
                            
                                Reduce PDF file size of plots by filtering hidden objects
                            
                                R: horizontal barplot with y-axis-labels next to every bar
                            
                                R Error using readHTMLTable
                            
                                Example for svm feature selection in R
                            
                                Creating a monthly/yearly calendar image with ggplot2
                            
                                To find the difference between two column elements in a data frame
                            
                                Where does the bootstrap standard error live in the boot class?
                            
                                R programming: plyr how to count values from a column with ddply [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With