I am trying to run some basic linear algebra operations (specifically transpose, dot product, and inverse) on a matrix stored as a spark RowMatrix as described herehere (using the Python API). Following the example in the docs (for my case I will have many more rows in the matrix, hence the need for Spark), suppose I have something like this:
from pyspark.mllib.linalg.distributed import RowMatrix
# Create an RDD of vectors.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# Create a RowMatrix from an RDD of vectors.
mat = RowMatrix(rows)
Given such a distributed matrix, are there existing routines for doing matrix transpose and dot product, e.g:
dot(mat.T,mat)
or matrix inverse?
inverse(mat)
I can't seem to find anything in the documentation about this. Looking for either (a) a pointer to the relevant docs or (b) a method for implementing this myself.
As for now (Spark 1.6.0) pyspark.mllib.linalg.distributed
API is limited to basic operations like counting rows/columns and transformations between types.
Scala API supports a broader set of methods including multiplication (RowMatrix.multiply
, Indexed.RowMatrix.multiply
), transposition, SVD (IndexedRowMatrix.computeSVD
), QR decomposition (RowMatrix.tallSkinnyQR
), Grammian Matrix computation (computeGramianMatrix
), PCA (RowMatrix.computePrincipalComponents
) which can be used to implement more complex linear algebra functions.
In Spark 1.6 and later, you can do matrix arithmetic operations by means of the BlockMatrix class. Only multiply and add are available in Spark 1.6. In Spark 2.0, more are added. As of this writing you would have to implement inverse by hand, but dot and transpose is available. https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/linalg/distributed.py#L811. Here's a Spark 1.6 example.
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix, BlockMatrix
sc = SparkContext()
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) \
.zipWithIndex()
# need a SQLContext() to generate an IndexedRowMatrix from RDD
sqlContext = SQLContext(sc)
rows = IndexedRowMatrix( \
rows \
.map(lambda row: IndexedRow(row[1], row[0])) \
).toBlockMatrix()
mat_product = rows.multiply(<SOME OTHER BLOCK MATRIX>)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With