I'm new to Spark and Scala and I'm trying to read its documentation on MLlib.
The tutorial on http://spark.apache.org/docs/1.4.0/mllib-data-types.html,
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
does not show how to construct an RDD[Vector] (variable rows) from a list of local vectors.
So for example, I have executed (as part of my exploration) in spark-shell
val v0: Vector = Vectors.dense(1.0, 0.0, 3.0)
val v1: Vector = Vectors.sparse(3, Array(1), Array(2.5))
val v2: Vector = Vectors.sparse(3, Seq((0, 1.5),(1, 1.8)))
which if 'merged' will look like this matrix
1.0 0.0 3.0
0.0 2.5 0.0
1.5 1.8 0.0
So, how do I transform Vectors v0
, v1
, v2
to rows
?
By using the property of Spark Context which parallelize the Sequence, we can achieve the thing you want, Since you have created vectors,now all you required to bring them in sequence and parallelize by the process given below.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val v0 = Vectors.dense(1.0, 0.0, 3.0)
val v1 = Vectors.sparse(3, Array(1), Array(2.5))
val v2 = Vectors.sparse(3, Seq((0, 1.5), (1, 1.8)))
val rows = sc.parallelize(Seq(v0, v1, v2))
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With