Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up correlation matrix calculation in R

Tags:

r

correlation

I have a dataframe with 49 variables and 4M rows. I want to calculate the correlation matrix of 49 x 49. All columns are of class numeric.

Here's a sample :

df <- data.frame(replicate(49,sample(0:50,4000000,rep=TRUE)))

I used the standard cor function.

cor_matrix <- cor(df, use = "pairwise.complete.obs")

This is taking a really long time. I have 16GB RAM and an i5 single core 2.60Ghz.

Is there a way to make this calculation faster on my desktop?

like image 219
vagabond Avatar asked Mar 21 '16 16:03

vagabond


1 Answers

There's a faster version of the cor function in the WGCNA package (used for inferring gene networks based on correlations). On my 3.1 GHz i7 w/ 16 GB of RAM it can solve the same 49 x 49 matrix about 20x faster:

mat <- replicate(49, as.numeric(sample(0:50,4000000,rep=TRUE)))

system.time(
    cor_matrix <- cor(mat, use = "pairwise.complete.obs")
)
user  system elapsed 
40.391   0.017  40.396 

system.time(
    cor_matrix_w <- WGCNA::cor(mat, use = "pairwise.complete.obs")
)
user  system elapsed 
1.822   0.468   2.290 

all.equal(cor_matrix, cor_matrix_w)
[1] TRUE

Check the helpfile for the function for details on differences between versions when your data contains more missing observations.

like image 170
Lorenz D Avatar answered Nov 05 '22 11:11

Lorenz D