Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed cross correlation matrix computation

How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be appreciated.

update: I read the implementation of apache spark mlib correlation

Pearson Computaation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
Covariance Computation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

but for me it looks like all the computation is happening at one node and it is not distributed in real sense.

Please put some light in here. I also tried executing it on a 3 node spark cluster and below are the screenshot:

Entire Computation timeline One the task details

As you can see from 2nd image that data is pulled up at one node and then computation is being done.Am i right in here ?

like image 636
Roshan Mehta Avatar asked Feb 17 '17 17:02

Roshan Mehta


People also ask

How do you calculate cross-correlation matrix?

To compute the cross-correlation of two matrices, compute and sum the element-by-element products for every offset of the second matrix relative to the first. With several caveats, this can be used to calculate the offset required to get 2 matrices of related values to overlap.

What is cross-correlation equation?

Cross-correlation between {Xi } and {Xj } is defined by the ratio of covariance to root-mean variance, ρ i , j = γ i , j σ i 2 σ j 2 .

How do you calculate covariance matrix from correlation matrix?

You can use similar operations to convert a covariance matrix to a correlation matrix. First, use the DIAG function to extract the variances from the diagonal elements of the covariance matrix. Then invert the matrix to form the diagonal matrix with diagonal elements that are the reciprocals of the standard deviations.


1 Answers

To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.

like image 181
dangiankit Avatar answered Sep 19 '22 20:09

dangiankit