Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope?

I have read somewhere that MLlib local vectors/matrices are currently wrapping Breeze implementation, but the methods converting MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope. The suggestion to work around this is to write your code in org.apache.spark.mllib.something package.

Is there a better way to do this? Can you cite some relevant examples?

Thanks and regards,

like image 239
learning_spark Avatar asked Oct 30 '14 22:10

learning_spark


People also ask

What is basic difference between row matrix and IndexedRow Matrix?

An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices. It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector. An IndexedRowMatrix can be created from an RDD[IndexedRow] instance, where IndexedRow is a wrapper over (Long, Vector) .

What is a distributed matrix?

Distributed matrix. A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices.


2 Answers

I did the same solution as @dlwh suggested. Here is the code that does it:

package org.apache.spark.mllib.linalg

object VectorPub {

  implicit class VectorPublications(val vector : Vector) extends AnyVal {
    def toBreeze : breeze.linalg.Vector[scala.Double] = vector.toBreeze

  }

  implicit class BreezeVectorPublications(val breezeVector : breeze.linalg.Vector[Double]) extends AnyVal {
    def fromBreeze : Vector = Vectors.fromBreeze(breezeVector)
  }
}

notice that the implicit class extends AnyVal to prevent allocation of a new object when calling those methods

like image 59
lev Avatar answered Oct 05 '22 06:10

lev


My solution is kind of a hybrid of those of @barclar and @lev, above. You don't need to put your code in the org.apache.spark.mllib.linalg if you don't make use of the spark-ml implicit conversions. You can define your own implicit conversions in your own package, like:

package your.package

import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.ml.linalg.Vector
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}

object BreezeConverters
{
    implicit def toBreeze( dv: DenseVector ): BDV[Double] =
        new BDV[Double](dv.values)

    implicit def toBreeze( sv: SparseVector ): BSV[Double] =
        new BSV[Double](sv.indices, sv.values, sv.size)

    implicit def toBreeze( v: Vector ): BV[Double] =
        v match {
            case dv: DenseVector => toBreeze(dv)
            case sv: SparseVector => toBreeze(sv)
        }

    implicit def fromBreeze( dv: BDV[Double] ): DenseVector =
        new DenseVector(dv.toArray)

    implicit def fromBreeze( sv: BSV[Double] ): SparseVector =
        new SparseVector(sv.length, sv.index, sv.data)

    implicit def fromBreeze( bv: BV[Double] ): Vector =
        bv match {
            case dv: BDV[Double] => fromBreeze(dv)
            case sv: BSV[Double] => fromBreeze(sv)
        }
}

Then you can import these implicits into your code with:

import your.package.BreezeConverters._
like image 28
corvi42 Avatar answered Oct 05 '22 06:10

corvi42