How to calculate correlation of two variables in a huge data set in R?

Question

I've got a huge data set with six columns (call them A, B, C, D, E, F), about 450,000 rows. I simply tried to find the correlation between columns A and B:

cor(A, B)

and I got

[1] NA

as a result. What can I do to fix this problem?

Iterator · Accepted Answer

Try cor(A,B, use = "pairwise.complete.obs"). That will ignore the NAs in your observations.

To be statistically rigorous, you should also look at the # of missing entries in your data and look at whether the missing at random assumption holds.

Edit 1: Take a look at ?cor to see other options for the use parameter.

Iain · Answer

You might consider using the rcorr function in the Hmisc package.

It is very fast, and only includes pairwise complete observations. The returned object contains a matrix

of correlation scores
with the number of observation used for each correlation value
of a p-value for each correlation

Some example code is available here:

How to calculate correlation of two variables in a huge data set in R?

Tags:

r

correlation

vieplivee

2 Answers

Iterator

Iain

Recent Activity

Donate For Us

How to calculate correlation of two variables in a huge data set in R?

Tags:

r

correlation

vieplivee

2 Answers

Iterator

Iain

Related questions

Recent Activity

Donate For Us