I am trying to perform matrix multiplication using Apache Spark and Java.
I have 2 main questions:
What is the running time of naïve matrix multiplication algorithm? Explanation: The traditional matrix multiplication algorithm takes O(n3) time. The number of recursive multiplications involved in this algorithm is 8.
matmul differs from dot in two important ways. Multiplication by scalars is not allowed. Stacks of matrices are broadcast together as if the matrices were elements.
Matrix multiplication is another important program that makes use of the two-dimensional arrays to multiply the cluster of values in the form of matrices and with the rules of matrices of mathematics. In this C program, the user will insert the order for a matrix followed by that specific number of elements.
Matrix multiplications in NumPy are reasonably fast without the need for optimization. However, if every second counts, it is possible to significantly improve performance (even without a GPU).
All depends on the input data and dimensions but generally speaking what you want is not a RDD
but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed
. At this moment it provides four different implementations of the DistributedMatrix
IndexedRowMatrix
- can be created directly from a RDD[IndexedRow]
where IndexedRow
consist of row index and org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.{Vectors, Matrices} import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix, IndexedRow} val rows = sc.parallelize(Seq( (0L, Array(1.0, 0.0, 0.0)), (0L, Array(0.0, 1.0, 0.0)), (0L, Array(0.0, 0.0, 1.0))) ).map{case (i, xs) => IndexedRow(i, Vectors.dense(xs))} val indexedRowMatrix = new IndexedRowMatrix(rows)
RowMatrix
- similar to IndexedRowMatrix
but without meaningful row indices. Can be created directly from RDD[org.apache.spark.mllib.linalg.Vector]
import org.apache.spark.mllib.linalg.distributed.RowMatrix val rowMatrix = new RowMatrix(rows.map(_.vector))
BlockMatrix
- can be created from RDD[((Int, Int), Matrix)]
where first element of the tuple contains coordinates of the block and the second one is a local org.apache.spark.mllib.linalg.Matrix
val eye = Matrices.sparse( 3, 3, Array(0, 1, 2, 3), Array(0, 1, 2), Array(1, 1, 1)) val blocks = sc.parallelize(Seq( ((0, 0), eye), ((1, 1), eye), ((2, 2), eye))) val blockMatrix = new BlockMatrix(blocks, 3, 3, 9, 9)
CoordinateMatrix
- can be created from RDD[MatrixEntry]
where MatrixEntry
consist of row, column and value.
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry} val entries = sc.parallelize(Seq( (0, 0, 3.0), (2, 0, -5.0), (3, 2, 1.0), (4, 1, 6.0), (6, 2, 2.0), (8, 1, 4.0)) ).map{case (i, j, v) => MatrixEntry(i, j, v)} val coordinateMatrix = new CoordinateMatrix(entries, 9, 3)
First two implementations support multiplication by a local Matrix
:
val localMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0)) indexedRowMatrix.multiply(localMatrix).rows.collect // Array(IndexedRow(0,[1.0,4.0]), IndexedRow(0,[2.0,5.0]), // IndexedRow(0,[3.0,6.0]))
and the third one can be multiplied by an another BlockMatrix
as long as number of columns per block in this matrix matches number of rows per block of the other matrix. CoordinateMatrix
doesn't support multiplications but is pretty easy to create and transform to other types of distributed matrices:
blockMatrix.multiply(coordinateMatrix.toBlockMatrix(3, 3))
Each type has its own strong and weak sides and there are some additional factors to consider when you use sparse or dense elements (Vectors
or block Matrices
). Multiplying by a local matrix is usually preferable since it doesn't require expensive shuffling.
You can find more details about each type in the MLlib Data Types guide.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With