Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Basic linear algebra on spark matrices

I am trying to run some basic linear algebra operations (specifically transpose, dot product, and inverse) on a matrix stored as a spark RowMatrix as described herehere (using the Python API). Following the example in the docs (for my case I will have many more rows in the matrix, hence the need for Spark), suppose I have something like this:

from pyspark.mllib.linalg.distributed import RowMatrix
# Create an RDD of vectors.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# Create a RowMatrix from an RDD of vectors.
mat = RowMatrix(rows)

Given such a distributed matrix, are there existing routines for doing matrix transpose and dot product, e.g:

dot(mat.T,mat)

or matrix inverse?

inverse(mat)

I can't seem to find anything in the documentation about this. Looking for either (a) a pointer to the relevant docs or (b) a method for implementing this myself.

like image 337
moustachio Avatar asked Sep 21 '15 20:09

moustachio


2 Answers

As for now (Spark 1.6.0) pyspark.mllib.linalg.distributed API is limited to basic operations like counting rows/columns and transformations between types.

Scala API supports a broader set of methods including multiplication (RowMatrix.multiply, Indexed.RowMatrix.multiply), transposition, SVD (IndexedRowMatrix.computeSVD), QR decomposition (RowMatrix.tallSkinnyQR), Grammian Matrix computation (computeGramianMatrix), PCA (RowMatrix.computePrincipalComponents) which can be used to implement more complex linear algebra functions.

like image 174
zero323 Avatar answered Sep 28 '22 20:09

zero323


In Spark 1.6 and later, you can do matrix arithmetic operations by means of the BlockMatrix class. Only multiply and add are available in Spark 1.6. In Spark 2.0, more are added. As of this writing you would have to implement inverse by hand, but dot and transpose is available. https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/linalg/distributed.py#L811. Here's a Spark 1.6 example.

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix, BlockMatrix

sc = SparkContext()
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) \
    .zipWithIndex()

# need a SQLContext() to generate an IndexedRowMatrix from RDD
sqlContext = SQLContext(sc)
rows = IndexedRowMatrix( \
    rows \
    .map(lambda row: IndexedRow(row[1], row[0])) \
    ).toBlockMatrix()

mat_product = rows.multiply(<SOME OTHER BLOCK MATRIX>)
like image 21
Paul Back Avatar answered Sep 28 '22 21:09

Paul Back