How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be appreciated.
update: I read the implementation of apache spark mlib correlation
Pearson Computaation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
Covariance Computation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
but for me it looks like all the computation is happening at one node and it is not distributed in real sense.
Please put some light in here. I also tried executing it on a 3 node spark cluster and below are the screenshot:
As you can see from 2nd image that data is pulled up at one node and then computation is being done.Am i right in here ?
To compute the cross-correlation of two matrices, compute and sum the element-by-element products for every offset of the second matrix relative to the first. With several caveats, this can be used to calculate the offset required to get 2 matrices of related values to overlap.
Cross-correlation between {Xi } and {Xj } is defined by the ratio of covariance to root-mean variance, ρ i , j = γ i , j σ i 2 σ j 2 .
You can use similar operations to convert a covariance matrix to a correlation matrix. First, use the DIAG function to extract the variances from the diagonal elements of the covariance matrix. Then invert the matrix to form the diagonal matrix with diagonal elements that are the reciprocals of the standard deviations.
To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With