Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do i calculate correlation between corresponding columns of two matrices and not getting other correlations as output

Tags:

r

correlation

I have these data

> a
     a    b    c
1    1   -1    4
2    2   -2    6
3    3   -3    9
4    4   -4   12
5    5   -5    6

> b
     d    e    f
1    6   -5    7
2    7   -4    4
3    8   -3    3
4    9   -2    3
5   10   -1    9

> cor(a,b)
           d            e             f
a  1.0000000    1.0000000     0.1767767
b -1.0000000    -1.000000    -0.1767767
c  0.5050763    0.5050763    -0.6964286

The result I want is just:

cor(a,d) = 1
cor(b,e) = -1
cor(c,f) = -0.6964286
like image 450
rder Avatar asked Jul 15 '11 22:07

rder


People also ask

How do you find the correlation coefficient between two matrices?

R = corrcoef( A ) returns the matrix of correlation coefficients for A , where the columns of A represent random variables and the rows represent observations. R = corrcoef( A , B ) returns coefficients between two random variables A and B .

How do you find the correlation between two columns?

Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.

How do you determine if there is a correlation between two data sets?

The correlation coefficient is determined by dividing the covariance by the product of the two variables' standard deviations. Standard deviation is a measure of the dispersion of data from its average. Covariance is a measure of how two variables change together.

Which function is used to find the correlation of two columns?

By using corr() function we can get the correlation between two columns in the dataframe.


3 Answers

The first answer above calculates all pairwise correlations, which is fine unless the matrices are large, and the second one doesn't work. As far as I can tell, efficient computation must be done directly, such as this code borrowed from borrowed from the arrayMagic Bioconductor package, works efficiently for large matrices:

> colCors = function(x, y) { 
+   sqr = function(x) x*x
+   if(!is.matrix(x)||!is.matrix(y)||any(dim(x)!=dim(y)))
+     stop("Please supply two matrices of equal size.")
+   x   = sweep(x, 2, colMeans(x))
+   y   = sweep(y, 2, colMeans(y))
+   cor = colSums(x*y) /  sqrt(colSums(sqr(x))*colSums(sqr(y)))
+   return(cor)
+ }

> set.seed(1)
> a=matrix(rnorm(15),nrow=5)
> b=matrix(rnorm(15),nrow=5)
> diag(cor(a,b))
[1]  0.2491625 -0.5313192  0.5594564
> mapply(cor,a,b)
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
> colCors(a,b)
[1]  0.2491625 -0.5313192  0.5594564
like image 76
user1048410 Avatar answered Oct 11 '22 12:10

user1048410


I would probably personally just use diag:

> diag(cor(a,b))
[1]  1.0000000 -1.0000000 -0.6964286

But you could also use mapply:

> mapply(cor,a,b)
         a          b          c 
 1.0000000 -1.0000000 -0.6964286
like image 35
Joshua Ulrich Avatar answered Oct 11 '22 12:10

Joshua Ulrich


mapply works with data frames but not matrices. That is because in data frames each column is an element, while in matrices each entry is an element.

In the answer above mapply(cor,as.data.frame(a),as.data.frame(b)) works just fine.

set.seed(1)
a=matrix(rnorm(15),nrow=5)
b=matrix(rnorm(15),nrow=5)
diag(cor(a,b))
[1]  0.2491625 -0.5313192  0.5594564
mapply(cor,as.data.frame(a),as.data.frame(b))
    V1         V2         V3 
 0.2491625 -0.5313192  0.5594564 

This is much more efficient for large matrices.

like image 41
Cão Avatar answered Oct 11 '22 11:10

Cão