Basic linear algebra on spark matrices

Question

I am trying to run some basic linear algebra operations (specifically transpose, dot product, and inverse) on a matrix stored as a spark RowMatrix as described herehere (using the Python API). Following the example in the docs (for my case I will have many more rows in the matrix, hence the need for Spark), suppose I have something like this:

from pyspark.mllib.linalg.distributed import RowMatrix
# Create an RDD of vectors.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# Create a RowMatrix from an RDD of vectors.
mat = RowMatrix(rows)

Given such a distributed matrix, are there existing routines for doing matrix transpose and dot product, e.g:

dot(mat.T,mat)

or matrix inverse?

inverse(mat)

I can't seem to find anything in the documentation about this. Looking for either (a) a pointer to the relevant docs or (b) a method for implementing this myself.

zero323 · Accepted Answer

As for now (Spark 1.6.0) pyspark.mllib.linalg.distributed API is limited to basic operations like counting rows/columns and transformations between types.

Scala API supports a broader set of methods including multiplication (RowMatrix.multiply, Indexed.RowMatrix.multiply), transposition, SVD (IndexedRowMatrix.computeSVD), QR decomposition (RowMatrix.tallSkinnyQR), Grammian Matrix computation (computeGramianMatrix), PCA (RowMatrix.computePrincipalComponents) which can be used to implement more complex linear algebra functions.

Paul Back · Answer

In Spark 1.6 and later, you can do matrix arithmetic operations by means of the BlockMatrix class. Only multiply and add are available in Spark 1.6. In Spark 2.0, more are added. As of this writing you would have to implement inverse by hand, but dot and transpose is available. https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/linalg/distributed.py#L811. Here's a Spark 1.6 example.

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix, BlockMatrix

sc = SparkContext()
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) \
    .zipWithIndex()

# need a SQLContext() to generate an IndexedRowMatrix from RDD
sqlContext = SQLContext(sc)
rows = IndexedRowMatrix( \
    rows \
    .map(lambda row: IndexedRow(row[1], row[0])) \
    ).toBlockMatrix()

mat_product = rows.multiply(<SOME OTHER BLOCK MATRIX>)

Basic linear algebra on spark matrices

Tags:

python

matrix

apache-spark

moustachio

2 Answers

zero323

Paul Back

Recent Activity

Donate For Us

Basic linear algebra on spark matrices

Tags:

python

matrix

apache-spark

moustachio

2 Answers

zero323

Paul Back

Related questions

Recent Activity

Donate For Us