I have a dataframe with 49 variables and 4M rows. I want to calculate the correlation matrix of 49 x 49. All columns are of class numeric.
Here's a sample :
df <- data.frame(replicate(49,sample(0:50,4000000,rep=TRUE)))
I used the standard cor
function.
cor_matrix <- cor(df, use = "pairwise.complete.obs")
This is taking a really long time. I have 16GB RAM and an i5 single core 2.60Ghz.
Is there a way to make this calculation faster on my desktop?
There's a faster version of the cor function in the WGCNA package (used for inferring gene networks based on correlations). On my 3.1 GHz i7 w/ 16 GB of RAM it can solve the same 49 x 49 matrix about 20x faster:
mat <- replicate(49, as.numeric(sample(0:50,4000000,rep=TRUE)))
system.time(
cor_matrix <- cor(mat, use = "pairwise.complete.obs")
)
user system elapsed
40.391 0.017 40.396
system.time(
cor_matrix_w <- WGCNA::cor(mat, use = "pairwise.complete.obs")
)
user system elapsed
1.822 0.468 2.290
all.equal(cor_matrix, cor_matrix_w)
[1] TRUE
Check the helpfile for the function for details on differences between versions when your data contains more missing observations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With