Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate correlation of two variables in a huge data set in R?

Tags:

r

correlation

I've got a huge data set with six columns (call them A, B, C, D, E, F), about 450,000 rows. I simply tried to find the correlation between columns A and B:

cor(A, B)

and I got

[1] NA

as a result. What can I do to fix this problem?

like image 692
vieplivee Avatar asked Sep 26 '11 06:09

vieplivee


2 Answers

Try cor(A,B, use = "pairwise.complete.obs"). That will ignore the NAs in your observations.

To be statistically rigorous, you should also look at the # of missing entries in your data and look at whether the missing at random assumption holds.

Edit 1: Take a look at ?cor to see other options for the use parameter.

like image 118
Iterator Avatar answered Oct 05 '22 20:10

Iterator


You might consider using the rcorr function in the Hmisc package.

It is very fast, and only includes pairwise complete observations. The returned object contains a matrix

  1. of correlation scores
  2. with the number of observation used for each correlation value
  3. of a p-value for each correlation

Some example code is available here:

like image 25
Iain Avatar answered Oct 05 '22 20:10

Iain